Pythonic Extractor

This extractor is unique from the rest of the extractors, because it gives more power to the directly to the user, giving full access to html content.

The extractor is available at invana_bot.extractors.PythonBasedExtractor.

User has to write a python function in ib_functions.py and the name of the function should be specified in the manifest.yml as shown in the example.

# in manifest.yml
- spider_id: default_spider
  allowed_domains:
    - "github.com"
  extractors:
  - extractor_id: page_detection
    extractor_type: PythonBasedExtractor
    extractor_fn: default_extractor_fn
# ib_functions.py

def default_extractor_fn(response=None):
    """

    """
    url = response.url

    data = {}
    if "/contact" in url:
        data["page_type"] = "contact"
    elif "/blog/" in url:
        data["page_type"] = "blog"
    elif "/about" in url:
        data["page_type"] = "about"
    elif "/service" in url:
        data["page_type"] = "service"
    elif "/product" in url:
        data["page_type"] = "product"
    elif url.strip("/").count("/") == 2:  # this can be improved.
        data["page_type"] = "homepage"
    else:
        data["page_type"] = "others"
    return data