Custom Extractor
Custom extractor allows user to build their own extractor data selectors.
CustomContentExtractor
For custom extractor, You need to define what data to extract and how to save it in the db. You can either give just one or
multiple selector
data or add child_selectors to any selector.
NOTE: Only one level of child_selector
is only supported.
Usage
spiders:
- spider_id: blog_list
extractors:
- extractor_type: CustomContentExtractor
extractor_id: blog_list_parser
data_selectors:
- selector_id: blogs
selector: ".post-listing .post-item"
selector_attribute: element
multiple: true
child_selectors:
- selector_id: url
selector: ".post-header h2 a"
selector_type: css
selector_attribute: href
multiple: false
- selector_id: title
selector: ".post-header h2 a"
selector_type: css
selector_attribute: text
multiple: false
- selector_id: content
selector: ".post-content"
selector_type: css
selector_attribute: html
multiple: false