Using uri_spider to parse file metadata

The uri_spider task, when given a Uri entity such as http://www.acme.com, will spider a site to a specified level of depth (max_depth), a specified max number of pages (limit), and if configured, a specified url pattern (spider_whitelist). When configured – and by default – it will extract DnsRecord types, PhoneNumbers and EmailAddress type entities in the content of the page. All spidered Uris can be can created as entities using the extract_uris option.

Further, the spider will identify any files of the types listed below, and parse their content and metadata for the same types. Because this file parsing uses the excellent Apache Tika under the hood, the number and type of supported file formats is huge – over 300 file formats are supported including common formats like doc, docx and pdf – as well as more exotic types like application/ogg and many video formats. To enable this, simply enable the parse_file_metadata option.

Below, see a screenshot of the task’s configuration:

uri_spider task configuration

Note that you can also take advantage of Intrigue Core’s file parsing capabilities on a Uri by Uri basis by pointing the uri_extract_metadata task at a specific Uri with a file you’d like parsed, such at https://acme.com/file.pdf

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s