java-解压缩HTTPInputStream时过早关闭GZIPInputStream
作者:互联网
题
在下面的“编辑”部分中查看更新的问题
我正在尝试使用GZIPInputStream从Amazon S3快速解压缩大的(〜300M)GZIP文件,但它仅输出文件的一部分;但是,如果我在解压缩之前下载到文件系统,则GZIPInputStream将解压缩整个文件.
如何获得GZIPInputStream解压缩整个HTTPInputStream而不只是它的第一部分?
我尝试过的
请参阅下面的编辑部分中的更新
我怀疑有一个HTTP问题,只是没有抛出任何异常,GZIPInputStream每次都返回一个相当一致的文件块,据我所知,它总是在WET记录边界上中断,尽管每个选择的边界都是不同的URL(这很奇怪,因为所有内容都被视为二进制流,根本没有对文件中的WET记录进行任何解析.)
我能找到的最接近的问题是
GZIPInputStream is prematurely closed when reading from s3该问题的答案是,某些GZIP文件实际上是多个附加的GZIP文件,而GZIPInputStream处理得不好.但是,如果是这种情况,为什么GZIPInputStream在文件的本地副本上可以正常工作?
演示代码和输出
下面是一段示例代码,演示了我所遇到的问题.我已经在两个不同网络上的两台不同Linux计算机上使用Java 1.8.0_72和1.8.0_112对它进行了测试,结果相似.我希望来自解压缩的HTTPInputStream的字节数与来自文件的解压缩的本地副本的字节数相同,但是经过解压缩的HTTPInputStream小得多.
输出量
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 87894 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile0.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 1772936 bytes from HTTP->GZIP
Read 451171329 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile40.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 89217 bytes from HTTP->GZIP
Read 453183600 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile500.wet
样例代码
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import java.nio.channels.*;
public class GZIPTest {
public static void main(String[] args) throws Exception {
// Our three test files from CommonCrawl
URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz");
URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz");
/*
* Test the URLs and display the results
*/
test(url0, "testfile0.wet");
System.out.println("------");
test(url40, "testfile40.wet");
System.out.println("------");
test(url500, "testfile500.wet");
}
public static void test(URL url, String testGZFileName) throws Exception {
System.out.println("Testing URL "+url.toString());
// First directly wrap the HTTPInputStream with GZIPInputStream
// and count the number of bytes we read
// Go ahead and save the extracted stream to a file for further inspection
System.out.println("Testing HTTP Input Stream direct to GZIPInputStream");
int bytesFromGZIPDirect = 0;
URLConnection urlConnection = url.openConnection();
FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName);
// FIRST TEST - Decompress from HTTPInputStream
GZIPInputStream gzipishttp = new GZIPInputStream(urlConnection.getInputStream());
byte[] buffer = new byte[1024];
int bytesRead = -1;
while ((bytesRead = gzipishttp.read(buffer, 0, 1024)) != -1) {
bytesFromGZIPDirect += bytesRead;
directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection
}
gzipishttp.close();
directGZIPOutStream.close();
// Now save the GZIPed file locally
System.out.println("Testing saving to file before decompression");
int bytesFromGZIPFile = 0;
ReadableByteChannel rbc = Channels.newChannel(url.openStream());
FileOutputStream outputStream = new FileOutputStream("./test.wet.gz");
outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
outputStream.close();
// SECOND TEST - decompress from FileInputStream
GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz"));
buffer = new byte[1024];
bytesRead = -1;
while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) {
bytesFromGZIPFile += bytesRead;
}
gzipis.close();
// The Results - these numbers should match but they don't
System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP");
System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP");
System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName);
}
}
编辑
根据@VGR的注释,演示代码中的Closed Stream和关联的Channel.
更新:
该问题似乎确实是文件特有的.我在本地(wget)下拉了Common Crawl WET存档,将其解压缩(gunzip 1.8),然后将其重新压缩(gzip 1.8),然后重新上传到S3,然后进行实时解压缩就可以了.如果您修改上面的示例代码以包括以下行,则可以看到测试:
// Original file from CommonCrawl hosted on S3
URL originals3 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
// Recompressed file hosted on S3
URL rezippeds3 = new URL("https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
test(originals3, "originalhost.txt");
test(rezippeds3, "rezippedhost.txt");
URL rezippeds3指向我下载,解压缩和重新压缩,然后重新上传到S3的WET存档文件.您将看到以下输出:
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 7212400 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file originals3.txt
-----
Testing URL https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 448974935 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file rezippeds3.txt
如您所见,一旦文件被重新压缩,我就可以通过GZIPInputStream对其进行流传输并获得整个文件.原始文件仍显示减压的通常过早结束.当我下载并上传WET文件而不进行重新压缩时,我得到了同样的不完整流行为,因此绝对是通过重新压缩解决了该问题.我还将原始和重新压缩的文件都放到了传统的Apache Web服务器上,并且能够复制结果,因此S3似乎与问题无关.
所以.我有一个新问题.
新问题
在读取相同内容时,为什么FileInputStream的行为与HTTPInputStream不同.如果是完全相同的文件,为什么:
新的GZIPInputStream(urlConnection.getInputStream());
表现与
新的GZIPInputStream(新的FileInputStream(“ ./ test.wet.gz”));
??输入流不只是输入流吗?
解决方法:
根本原因讨论
事实证明InputStreams可以有很大的不同.特别是,它们在实现.available()方法的方式上有所不同.例如,ByteArrayInputStream .available()返回InputStream中剩余的字节数.但是,HTTPInputStream .available()返回需要进行阻塞的IO请求以重新填充缓冲区之前可读取的字节数. (有关更多信息,请参见Java Docs)
问题在于,GZIPInputStream使用.available()的输出来确定在完成解压缩完整的GZIP文件后,InputStream中是否可能有其他GZIP文件可用.这是来自OpenJDK源文件GZIPInputStream.java方法readTrailer()的231行.
if (this.in.available() > 0 || n > 26) {
如果HTTPInputStream读取缓冲区在两个串联的GZIP文件的边界处清空,则GZIPInputStream调用.available(),它将以0响应,因为它将需要进入网络以重新填充缓冲区,因此GZIPInputStream将文件视为完整并过早关闭.
Common Crawl .wet归档文件是数百兆字节的小型串联GZIP文件,因此,最终HTTPInputStream缓冲区将在串联的GZIP文件之一的末尾清空,并且GZIPInputStream会过早关闭.这解释了问题中显示的问题.
解决方案和解决方法
该GIST包含一个针对jdk8u152-b00修订版12039的补丁程序和两个jtreg测试,这些测试消除了(我认为很不正确)对.available()的不正确依赖.
如果您无法修补JDK,则要确保可用()始终返回>. 0,强制GZIPInputStream始终检查流中的另一个GZIP文件.不幸的是,HTTPInputStream是私有的,因此您不能直接将其子类化,而是扩展InputStream并包装HTTPInputStream.下面的代码演示了解决方法.
演示代码和输出
这里的输出显示,按照上述方式包装HTTPInputStream时,从文件直接从HTTP读取串联的GZIP时,GZIPInputStream将产生相同的结果.
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 448974935 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile0.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 451171329 bytes from HTTP->GZIP
Read 451171329 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile40.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 453183600 bytes from HTTP->GZIP
Read 453183600 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile500.wet
这是使用InputStream包装器修改的问题的演示代码.
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import java.nio.channels.*;
public class GZIPTest {
// Here is a wrapper class that wraps an InputStream
// but always returns > 0 when .available() is called.
// This will cause GZIPInputStream to always make another
// call to the InputStream to check for an additional
// concatenated GZIP file in the stream.
public static class AvailableInputStream extends InputStream {
private InputStream is;
AvailableInputStream(InputStream inputstream) {
is = inputstream;
}
public int read() throws IOException {
return(is.read());
}
public int read(byte[] b) throws IOException {
return(is.read(b));
}
public int read(byte[] b, int off, int len) throws IOException {
return(is.read(b, off, len));
}
public void close() throws IOException {
is.close();
}
public int available() throws IOException {
// Always say that we have 1 more byte in the
// buffer, even when we don't
int a = is.available();
if (a == 0) {
return(1);
} else {
return(a);
}
}
}
public static void main(String[] args) throws Exception {
// Our three test files from CommonCrawl
URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz");
URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz");
/*
* Test the URLs and display the results
*/
test(url0, "testfile0.wet");
System.out.println("------");
test(url40, "testfile40.wet");
System.out.println("------");
test(url500, "testfile500.wet");
}
public static void test(URL url, String testGZFileName) throws Exception {
System.out.println("Testing URL "+url.toString());
// First directly wrap the HTTP inputStream with GZIPInputStream
// and count the number of bytes we read
// Go ahead and save the extracted stream to a file for further inspection
System.out.println("Testing HTTP Input Stream direct to GZIPInputStream");
int bytesFromGZIPDirect = 0;
URLConnection urlConnection = url.openConnection();
// Wrap the HTTPInputStream in our AvailableHttpInputStream
AvailableInputStream ais = new AvailableInputStream(urlConnection.getInputStream());
GZIPInputStream gzipishttp = new GZIPInputStream(ais);
FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName);
int buffersize = 1024;
byte[] buffer = new byte[buffersize];
int bytesRead = -1;
while ((bytesRead = gzipishttp.read(buffer, 0, buffersize)) != -1) {
bytesFromGZIPDirect += bytesRead;
directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection
}
gzipishttp.close();
directGZIPOutStream.close();
// Save the GZIPed file locally
System.out.println("Testing saving to file before decompression");
ReadableByteChannel rbc = Channels.newChannel(url.openStream());
FileOutputStream outputStream = new FileOutputStream("./test.wet.gz");
outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
// Now decompress the local file and count the number of bytes
int bytesFromGZIPFile = 0;
GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz"));
buffer = new byte[1024];
while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) {
bytesFromGZIPFile += bytesRead;
}
gzipis.close();
// The Results
System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP");
System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP");
System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName);
}
}
标签:gzipinputstream,amazon-s3,java 来源: https://codeday.me/bug/20191026/1936175.html