~mcf/mupdf

ref: e27ceb2b0e64b9a56ba79d844ea96553d87dc113 mupdf/docs/coding-progressive.html -rw-r--r-- 13.2 KiB
e27ceb2b — Robin Watts OSS-Fuzz 29728: Avoid buffer overflow. 1 year, 4 months ago
                                                                                
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
<!DOCTYPE html>
<html>
<head>
<title>MuPDF Progressive Loading</title>
<link rel="stylesheet" href="style.css" type="text/css">
</head>
<body>

<header>
<h1>MuPDF Progressive Loading</h1>
</header>

<article>

<p>
How to do progressive loading with MuPDF.

<h2>What is progressive loading?</h2>

<p>
The idea of progressive loading is that as you download a PDF file
into a browser, you can display the pages as they appear.

<p>
MuPDF can make use of 2 different mechanisms to achieve this. The
first relies on the file being "linearized", the second relies on
the caller of MuPDF having fine control over the http fetch and on
the server supporting byte-range fetches.

<p>
For optimum performance a file should be both linearized and be
available over a byte-range supporting link, but benefits can still
be had with either one of these alone.

<h2>Progressive download using "linearized" files</h2>

<p>
Adobe defines "linearized" PDFs as being ones that have both a
specific layout of objects and a small amount of extra
information to help avoid seeking within a file. The stated aim
is to deliver the first page of a document in advance of the whole
document downloading, whereupon subsequent pages will become
available. Adobe also refers to these as "Optimized for fast web
view" or "Web Optimized".

<p>
In fact, the standard outlines (poorly) a mechanism by which 'hints'
can be included that enable the subsequent pages to be found within
the file too. Unfortunately this is very poorly supported with
many tools, and so the hints have to be treated with suspicion.

<p>
MuPDF will attempt to use hints if they are available, but will also
use a linear search of the file to discover pages if not. This means
that the first page will be displayed quickly, and then subsequent
ones will appear with 'incomplete' renderings that improve over time
as more and more resources are gradually delivered.

<p>
Essentially the file starts with a slightly modified header, and the
first object in the file is a special one (the linearization object)
that a) indicates that the file is linearized, and b) gives some
useful information (like the number of pages in the file etc).

<p>
This object is then followed by all the objects required for the
first page, then the "hint stream", then sets of object for each
subsequent page in turn, then shared objects required for those
pages, then various other random things.

<p>
[Yes, really. While page 1 is sent with all the objects that it
uses, shared or otherwise, subsequent pages do not get shared
resources until after all the unshared page objects have been
sent.]

<h2>The Hint Stream</h2>

<p>
Adobe intended Hint Stream to be useful to facilitate the display
of subsequent pages, but it has never used it. Consequently you
can't trust people to write it properly - indeed Adobe outputs
something that doesn't quite conform to the spec.

<p>
Consequently very few people actually use it. MuPDF will use it
after sanity checking the values, and should cope with illegal/
incorrect values.

<h2>So how does MuPDF handle progressive loading?</h2>

<p>
MuPDF has made various extensions to its mechanisms for handling
progressive loading.

<ul>
	<li> Progressive streams

		<p>
		At its lowest level MuPDF reads file data from a fz_stream,
		using the fz_open_document_with_stream call. (fz_open_document
		is implemented by calling this). We have extended the fz_stream
		slightly, giving the system a way to ask for meta information
		(or perform meta operations) on a stream.

		<p>
		Using this mechanism MuPDF can query:

		<ul>
			<li> whether a stream is progressive or not (i.e. whether the
				entire stream is accessible immediately)
			<li> what the length of a stream should ultimately be (which an
				http fetcher should know from the Content-Length header),
		</ul>

		<p>
		With this information MuPDF can decide whether to use its normal
		object reading code, or whether to make use of a linearized
		object. Knowing the length enables us to check with the length
		value given in the linearized object - if these differ, the
		assumption is that an incremental save has taken place, thus the
		file is no longer linearized.

		<p>
		When data is pulled from a progressive stream, if we attempt to
		read data that is not currently available, the stream should
		throw a FZ_ERROR_TRYLATER error. This particular error code
		will be interpreted by the caller as an indication that it
		should retry the parsing of the current objects at a later time.

		<p>
		When a MuPDF call is made on a progressive stream, such as
		fz_open_document_with_stream, or fz_load_page, the caller should
		be prepared to handle a FZ_ERROR_TRYLATER error as meaning that
		more data is required before it can continue. No indication is
		directly given as to exactly how much more data is required, but
		as the caller will be implementing the progressive fz_stream
		that it has passed into MuPDF to start with, it can reasonably
		be expected to figure out an estimate for itself.

	<li> Cookie

		<p>
		Once a page has been loaded, if its contents are to be 'run'
		as normal (using e.g. fz_run_page) any error (such as failing
		to read a font, or an image, or even a content stream belonging
		to the page) will result in a rendering that aborts with an
		FZ_ERROR_TRYLATER error. The caller can catch this and display
		a placeholder instead.

		<p>
		If each pages data was entirely self-contained and sent in
		sequence this would perhaps be acceptable, with each page
		appearing one after the other. Unfortunately, the linearization
		procedure as laid down by Adobe does NOT do this: objects shared
		between multiple pages (other than the first) are not sent with
		the pages themselves, but rather AFTER all the pages have been
		sent.

		<p>
		This means that a document that has a title page, then contents
		that share a font used on pages 2 onwards, will not be able to
		correctly display page 2 until after the font has arrived in
		the file, which will not be until all the page data has been
		sent.

		<p>
		To mitigate against this, MuPDF provides a way whereby callers
		can indicate that they are prepared to accept an 'incomplete'
		rendering of the file (perhaps with missing images, or with
		substitute fonts).

		<p>
		Callers prepared to tolerate such renderings should set the
		'incomplete_ok' flag in the cookie, then call fz_run_page etc
		as normal. If a FZ_ERROR_TRYLATER error is thrown at any point
		during the page rendering, the error will be swallowed, the
		'incomplete' field in the cookie will become non-zero and
		rendering will continue. When control returns to the caller
		the caller can check the value of the 'incomplete' field and
		know that the rendering it received is not authoritative.

</ul>

<h2>Progressive loading using byte range requests</h2>

<p>
If the caller has control over the http fetch, then it is possible
to use byte range requests to fetch the document 'out of order'.
This enables non-linearized files to be progressively displayed as
they download, and fetches complete renderings of pages earlier than
would otherwise be the case. This process requires no changes within
MuPDF itself, but rather in the way the progressive stream learns
from the attempts MuPDF makes to fetch data.

<p>
Consider for example, an attempt to fetch a hypothetical file from
a server.

<ul>
	<li><p>
	The initial http request for the document is sent with a "Range:"
	header to pull down the first (say) 4k of the file.

	<li><p>
	As soon as we get the header in from this initial request, we can
	respond to meta stream operations to give the length, and whether
	byte requests are accepted.

	<ul>
		<li><p>
		If the header indicates that byte ranges are acceptable the
		stream proceeds to go into a loop fetching chunks of the file
		at a time (not necessarily in-order). Otherwise the server
		will ignore the Range: header, and just serve the whole file.

		<li><p>
		If the header indicates a content-length, the stream returns
		that.
	</ul>

	<li><p>
	MuPDF can then decide how to proceed based upon these flags and
	whether the file is linearized or not. (If the file contains a
	linearized object, and the content length matches, then the file
	is considered to be linear, otherwise it is not).

	<p>
	If the file is linear:

	<ul>
		<li><p>
		We proceed to read objects out of the file as it downloads.
		This will provide us the first page and all its resources. It
		will also enable us to read the hint streams (if present).

		<li><p>
		Once we have read the hint streams, we unpack (and sanity
		check) them to give us a map of where in the file each object
		is predicted to live, and which objects are required for each
		page. If any of these values are out of range, we treat the
		file as if there were no hint streams.

		<li><p>
		If we have hints, any attempt to load a subsequent page will
		cause MuPDF to attempt to read exactly the objects required.
		This will cause a sequence of seeks in the fz_stream followed
		by reads. If the stream does not have the data to satisfy that
		request yet, the stream code should remember the location that
		was fetched (and fetch that block in the background so that
		future retries will succeed) and should raise an
		FZ_ERROR_TRYLATER error.

		<p>
		[Typically therefore when we jump to a page in a linear file
		on a byte request capable link, we will quickly see a rough
		rendering, which will improve fairly fast as images and fonts
		arrive.]

		<li><p>
		Regardless of whether we have hints or byte requests, on every
		fz_load_page call MuPDF will attempt to process more of the
		stream (that is assumed to be being downloaded in the
		background). As linearized files are guaranteed to have pages
		in order, pages will gradually become available. In the absence
		of byte requests and hints however, we have no way of getting
		resources early, so the renderings for these pages will remain
		incomplete until much more of the file has arrived.

		<p>
		[Typically therefore when we jump to a page in a linear file
		on a non byte request capable link, we will see a rough
		rendering for that page as soon as data arrives for it (which
		will typically take much longer than would be the case with
		byte range capable downloads), and that will improve much more
		slowly as images and fonts may not appear until almost the
		whole file has arrived.]

		<li><p>
		When the whole file has arrived, then we will attempt to read
		the outlines for the file.
	</ul>

	<p>
	For a non-linearized PDF on a byte request capable stream:

	<ul>
		<li><p>
		MuPDF will immediately seek to the end of the file to attempt
		to read the trailer. This will fail with a FZ_ERROR_TRYLATER
		due to the data not being here yet, but the stream code should
		remember that this data is required and it should be prioritized
		in the background fetch process.

		<li><p>
		Repeated attempts to open the stream should eventually succeed
		therefore. As MuPDF jumps through the file trying to read first
		the xrefs, then the page tree objects, then the page contents
		themselves etc, the background fetching process will be driven
		by the attempts to read the file in the foreground.

		<p>
		[Typically therefore the opening of a non-linearized file will
		be slower than a linearized one, as the xrefs/page trees for a
		non-linear file can be 20%+ of the file data. Once past this
		initial point however, pages and data can be pulled from the
		file almost as fast as with a linearized file.]
	</ul>

	<p>
	For a non-linearized PDF on a non-byte request capable stream:

	<ul>
		<li><p>
		MuPDF will immediately seek to the end of the file to attempt
		to read the trailer. This will fail with a FZ_ERROR_TRYLATER
		due to the data not being here yet. Subsequent retries will
		continue to fail until the whole file has arrived, whereupon
		the whole file will be instantly available.

		<p>
		[This is the worst case situation - nothing at all can be
		displayed until the entire file has downloaded.]
	</ul>

	<p>
	A typical structure for a fetcher process (see curl-stream.c in
	mupdf-curl as an example) might therefore look like this:

	<ul>
		<li><p>
		We consider the file as an (initially empty) buffer which we are
		filling by making requests. In order to ensure that we make
		maximum use of our download link, we ensure that whenever
		one request finishes, we immediately launch another. Further, to
		avoid the overheads for the request/response headers being too
		large, we may want to divide the file into 'chunks', perhaps 4 or 32k
		in size.

		<li><p>
		We can then have a receiver process that sits there in a loop
		requesting chunks to fill this buffer. In the absence of
		any other impetus the receiver should request the next 'chunk'
		of data from the file that it does not yet have, following the last
		fill point. Initially we start the fill point at the beginning of
		the file, but this will move around based on the requests made of
		the progressive stream.

		<li><p>
		Whenever MuPDF attempts to read from the stream, we check to see if
		we have data for this area of the file already. If we do, we can
		return it. If not, we remember this as the next "fill point" for our
		receiver process and throw a FZ_ERROR_TRYLATER error.
	</ul>

</ul>

</article>

<footer>
<a href="http://www.artifex.com/"><img src="artifex-logo.png" align="right"></a>
Copyright &copy; 2006-2018 Artifex Software Inc.
</footer>

</body>
</html>