Extract data from non-HTML documents
The Crawler can extract data from files like PDFs and DOCs. To do this, it uses Apache Tika to extract a document’s content and transform it into a basic HTML file.
Limitations
Because it’s difficult to translate non-HTML documents into HTML, there are limitations:
- A PDF can break if it’s exported with an unknown font.
- The transformed HTML has little semantic value: headings, paragraphs, and lists in the original document might not be marked in the HTML. This makes good relevancy hard to achieve.
- Document indexing is slower than classic HTML indexing.
- Language detection isn’t available.
Enable document extraction
To enable document extraction, add the fileTypesToMatch parameter to at least one of your crawler’s actions.
The available fileTypesToMatch are:
htmlfor web pages. This is the default when nofileTypesToMatchparameter is presentpdffor PDF documentsdoc,xls, andpptfor Microsoft Office documentsodt,ods, andodpfor Open documentsemailfor electronic mail documents
The document’s transformed HTML is stored in the recordExtractor $ parameter.
The file type is stored in the recordExtractor filetype parameter.
1
2
3
4
5
6
7
8
9
10
11
12
13
({
[...]
actions: [
{
indexName: 'crawler-example',
pathsToMatch: ['https://www.example.com/**'],
fileTypesToMatch: ['pdf', 'doc'],
recordExtractor: ({ url, $, fileType }) => {
console.log($.html(), fileType);
}
},
]
});
Sample crawler configuration
For an example configuration for document extraction,
see config.documents.js on GitHub.
Supported file types
- Associated extension:
.pdf fileTypesToMatch:pdf
For example, in this .pdf file, Tika exposes the following HTML which the crawler then passes to your recordExtractor.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2018-07-17T13:35:40Z"/>
<meta name="pdf:PDFVersion" content="1.3"/>
<meta name="pdf:docinfo:title" content="test-docx-file.pages"/>
<meta name="xmp:CreatorTool" content="Pages"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="dcterms:created" content="2018-07-17T13:35:40Z"/>
<meta name="Last-Modified" content="2018-07-17T13:35:40Z"/>
<meta name="dcterms:modified" content="2018-07-17T13:35:40Z"/>
<meta name="dc:format" content="application/pdf; version=1.3"/>
<meta name="Last-Save-Date" content="2018-07-17T13:35:40Z"/>
<meta name="pdf:docinfo:creator_tool" content="Pages"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="pdf:docinfo:modified" content="2018-07-17T13:35:40Z"/>
<meta name="meta:save-date" content="2018-07-17T13:35:40Z"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="dc:title" content="test-docx-file.pages"/>
<meta name="modified" content="2018-07-17T13:35:40Z"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="meta:creation-date" content="2018-07-17T13:35:40Z"/>
<meta name="created" content="Tue Jul 17 13:35:40 UTC 2018"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="Creation-Date" content="2018-07-17T13:35:40Z"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="producer" content="Mac OS X 10.13.5 Quartz PDFContext"/>
<meta name="access_permission:can_modify" content="true"/>
<meta name="pdf:docinfo:producer" content="Mac OS X 10.13.5 Quartz PDFContext"/>
<meta name="pdf:docinfo:created" content="2018-07-17T13:35:40Z"/>
<title>test-docx-file.pages</title>
</head>
<body>
<div class="page">
<p/>
<p>Test PDF file content</p>
<p/>
</div>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.
Word document
- Associated extensions:
.doc,.docx fileTypesToMatch:doc
For example, in this .doc file, Tika exposes the following HTML, which the crawler then passes to its recordExtractor.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
<meta name="Content-Type" content="application/msword"/>
<title>
</title>
</head>
<body>
<div class="header"/>
<p class="body">Test DOC file content</p>
<div class="footer"/>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.
OpenDocument text
- Associated extension:
.odt fileTypesToMatch:odt
Excel spreadsheet
- Associated extensions:
.xls,.xlsx fileTypesToMatch:xls
For example, in this .xls file, Tika exposes the following HTML, which the crawler then passes to its recordExtractor.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
<meta name="Content-Type" content="application/vnd.ms-excel"/>
<title>
</title>
</head>
<body>
<div class="page">
<h1>Feuille 1</h1>
<table>
<tbody>
<tr>
<td>Test XLS file content</td>
</tr>
</tbody>
</table>
<div class="outside">&C&"Helvetica,Regular"&12&K000000&P</div>
</div>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.
OpenDocument spreadsheet
- Associated extension:
.ods fileTypesToMatch:ods
PowerPoint document
- Associated extensions:
.ppt,.pptx fileTypesToMatch:ppt
For example, in this .ppt file, Tika exposes the following HTML, which the crawler then passes to its recordExtractor.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
<meta name="Content-Type" content="application/vnd.ms-powerpoint"/>
<title>
</title>
</head>
<body>
<div class="slideShow">
<div class="slide">
<div class="slide-master-content"/>
<div class="slide-content">
<p>Test PPT file content</p>
<p/>
</div>
</div>
<div class="ocr"/>
</div>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.
OpenDocument presentation
- Associated extension:
.odp fileTypesToMatch:odp
Email documents
- Associated extension:
.msg fileTypesToMatch:email
The file type email includes all documents related to email.
The Crawler supports the Outlook Mail Message (.msg) format.
For example, Tika converts this email into the following HTML:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2017-06-01T15:24:31Z" />
<meta name="Message:To-Email" content="to@domain.com" />
<meta name="dc:description" content="this is a mail to test msg file" />
<meta name="subject" content="this is a mail to test msg file" />
<meta name="dc:creator" content="from@domain.com" />
<meta name="Message:From-Email" content="from@domain.com" />
<meta name="dcterms:created" content="2017-06-01T15:24:31Z" />
<meta name="Message-To" content="to@domain.com" />
<meta name="dcterms:modified" content="2017-06-01T15:24:31Z" />
<meta name="Last-Modified" content="2017-06-01T15:24:31Z" />
<meta name="Message-Recipient-Address" content="to@domain.com" />
<meta name="Message:Raw-Header:X-Unsent" content="1" />
<meta name="Message:Raw-Header:Subject" content="this is a mail to test msg file" />
<meta name="meta:mapi-message-class" content="NOTE" />
<meta name="Message:To-Display-Name" content="to@domain.com" />
<meta name="Last-Save-Date" content="2017-06-01T15:24:31Z" />
<meta name="Message:Raw-Header:MIME-Version" content="1.0" />
<meta name="meta:save-date" content="2017-06-01T15:24:31Z" />
<meta name="dc:title" content="this is a mail to test msg file" />
<meta name="Message:Raw-Header:Message-ID" content="<c58b1b52f61f4789ba40339c6e993440>" />
<meta name="modified" content="2017-06-01T15:24:31Z" />
<meta name="Content-Type" content="application/vnd.ms-outlook" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser" />
<meta name="creator" content="from@domain.com" />
<meta name="Message:Raw-Header:From" content="from@domain.com" />
<meta name="meta:author" content="from@domain.com" />
<meta name="meta:creation-date" content="2017-06-01T15:24:31Z" />
<meta name="meta:mapi-from-representing-email" content="from@domain.com" />
<meta name="Creation-Date" content="2017-06-01T15:24:31Z" />
<meta name="Message-Cc" content="" />
<meta name="Message-Bcc" content="" />
<meta name="meta:mapi-from-representing-name" content="from@domain.com" />
<meta name="Message:Raw-Header:To" content="to@domain.com" />
<meta name="Message:From-Name" content="from@domain.com" />
<meta name="Author" content="from@domain.com" />
<meta name="Message-From" content="from@domain.com" />
<meta name="Message:To-Name" content="" />
<title>this is a mail to test msg file</title>
</head>
<body>
<h1>this is a mail to test msg file</h1>
<dl>
<dt>From</dt>
<dd>from@domain.com</dd>
<dt>To</dt>
<dd>to@domain.com</dd>
<dt>Recipients</dt>
<dd>to@domain.com</dd>
</dl>
<div class="message-body">
<p>This message was sent using a msg file </p>
</div>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.