Standard Extractor

InvanaBot allows users to extract information while it crawls through the webpages. You can specify multiple extractors to a spider, allowing you to organise the information you need into grouped/subdocument data.

All the extractors are available at invana_bot.extractors

ParagraphsExtractor

Here is the configuration.

spiders:
- spider_id: default_spider
  extractors:
  - extractor_type: ParagraphsExtractor
    extractor_id: paragraphs_data

Here is the data extracted. This will return all the paragraphs as list.

{
  "paragraphs_data" :  [
      "Here is the first paragraph",
      "Here is the second paragraph"
    ]

}

TableContentExtractor

Here is the configuration.

spiders:
- spider_id: default_spider
  extractors:
  - extractor_type: TableContentExtractor
    extractor_id: tables_data

Here is the data extracted. This will return all the tables as list.

{  
   "tables_data":[  
      [ // table 1
         {  
            "Code":"IN",
            "Country":"India",
            "Last Checked":"1 second ago"
         }
      ],
      [ // table 2
        {
            "Code": "DEL",
            "State": "New Delhi"
        }
      ]
   ]
} 

MetaTagExtractor

Here is the configuration. This will gather data of og, twitter and fb meta tags, pretty much any meta element.

spiders:
- spider_id: default_spider
  extractors:
  - extractor_type: MetaTagExtractor
    extractor_id: meta_tag_data

Here is the data extracted. This will return all the <meta> tags data as dictionary.

{
  "meta_tag_data":{  
      "meta__viewport":"width=device-width,initial-scale=1,maximum-scale=1",
      "meta__referrer":"origin",
      "meta__description":"Here is the description of the site ",
      "title":"Example site | Homepage"
   }
}

IconsExtractor

Here is the configuration.

spiders:
- spider_id: default_spider
  extractors:
  - extractor_type: IconsExtractor
    extractor_id: icon_data

Here is the data extracted. This will return all the <meta> tags data as dictionary.

{
  "icon_data": {
      "32x32": "https://example.com/2018/10/fav-icon.png?w=32",
      "192x192": "https://example.com/2018/10/fav-icon.png?w=120"
    }
}

JSONLDExtractor

Here is the configuration.

spiders:
- spider_id: default_spider
  extractors:
  - extractor_type: JSONLDExtractor
    extractor_id: json_ld_data

Here is the data extracted. The returned data would be in list, as there can be multiple json+ld descriptions in a single page.


{
  "json_ld_data": [
    {
      "@context": "http://schema.org",
      "@type": "NewsArticle",
      "mainEntityOfPage": {
        "@type": "WebPage",
        "@id": "https://example.com/article/technology/science/space-age/"
      },
      "url": "https://example.com/article/technology/science/space-age/",
      "articleBody": "Lorem ipusum, here is the article body.",
      "articleSection": "technology",
      "keywords": "Chandrayaan-2, Chandrayaan-2 launch, Chandrayaan-2 launch monday, Chandrayaan-2 launch timing, Chandrayaan-2 isro launch date time, Chandrayaan-2 launch july 15, isro moon, isro Chandrayaan-2, Chandrayaan-2 moon",
      "headline": "Chandrayaan-2 to launch India into new space age",
      "description": "The Chandrayaan-2, a moon-lander and rover mission, is designed to go where no spacecraft has gone before.",
      "datePublished": "2019-07-14T09:29:56+05:30",
      "dateModified": "2019-07-14T09:29:56+05:30",
      "publisher": {
        "@type": "Organization",
        "name": "MoonNews Publications",
        "logo": {
          "@type": "ImageObject",
          "url": "https://s2.wp.com/wp-content/themes/vip/example.com-v2/dist/images/ienewlogo3_new.png",
          "width": "600",
          "height": "60"
        }
      },
      "author": {
        "@type": "Person",
        "name": "John Doe",
        "sameAs": "https://example.com/profile/author/john-doe/"
      },
      "image": {
        "@type": "ImageObject",
        "url": "https://images.example.com/2019/07/chandrayaan-2_759-1.jpg",
        "width": "759",
        "height": "422"
      }
    },
    {
      "@context": "http://schema.org",
      "@type": "Person",
      "name": "John Doe",
      "url": "https://example.com/profile/author/john-doe/",
      "worksFor": {
        "@type": "Organization",
        "name": "MoonNews Publications",
        "url": "https://example.com/"
      }
    },
    {
      "@context": "http://schema.org",
      "@type": "WebSite",
      "url": "https://example.com/",
      "potentialAction": {
        "@type": "SearchAction",
        "target": "https://example.com/?s={search_term_string}",
        "query-input": "required name=search_term_string"
      }
    }
  ]
}

PlainHTMLContentExtractor

Here is the configuration.

spiders:
- spider_id: default_spider
  extractors:
  - extractor_type: PlainHTMLContentExtractor
    extractor_id: html_content

Here is the data extracted.

{
  "html_content": "<html></body>Hello World!</body></html>"
}

PageOverviewExtractor

Here is the configuration.

spiders:
- spider_id: default_spider
  extractors:
  - extractor_type: PageOverviewExtractor
    extractor_id: overview

Here is the data extracted.



{  
   "overview":{  
      "title":"Welcome to the site.",
      "description":"Our space mission is designed to go where no spacecraft has gone before.",
      "image":"https://example.com/2019/07/image.jpg?w=759",
      "url":"https://example.com/article/technology/science/space-age/",
      "page_type":"article",
      "keywords":"space launch, chandrayaan-2 launch",
      "domain":"example.com",
      "first_paragraph":"First Paragraph comes here.",
      "shortlink_url":"https://example.com/?page=ok-me",
      "canonical_url":"https://example.com/article/technology/science/space-age/"
   }
}

FeedUrlExtractor

Here is the configuration.

spiders:
- spider_id: default_spider
  extractors:
  - extractor_type: FeedUrlExtractor
    extractor_id: feeds_data

Here is the data extracted.

{
  "feeds_data": {"rss__xml": "https://blog.scrapinghub.com/rss.xml", "rss__atom": null}
}

ImagesExtractor

Here is the configuration.

spiders:
- spider_id: default_spider
  extractors:
  - extractor_type: ImagesExtractor
    extractor_id: images

Here is the data extracted.

{
  "images": [
    "https://example.com/image-1.png",
    "https://example.com/image-2.png",
    "https://example.com/image-3.png"
  ]
}

AllLinksExtractor

Here is the configuration.

spiders:
- spider_id: default_spider
  extractors:
  - extractor_type: AllLinksExtractor
    extractor_id: all_links

Here is the data extracted. This will contains all links

{
  "all_links": [
    "https://example.com/page-1",
    "https://example.com/page-2",
    "https://example.com/page-3",
    "https://facebook.com/page-3",
    "https://twitter.com/page-3"
  ]
}

AllLinksAnalyticsExtractor

Here is the configuration.

spiders:
- spider_id: default_spider
  extractors:
  - extractor_type: AllLinksAnalyticsExtractor
    extractor_id: all_links_analysed

Here is the data extracted. This will contains all links in the page, seperating them into domain specific links.

{  
   "links":[  
      {  
         "domain":"blog.scrapinghub.com",
         "links":[  
            "https://blog.scrapinghub.com",
            "https://blog.scrapinghub.com/web-data-analysis-exposing-nfl-player-salaries-with-python",
            "https://blog.scrapinghub.com/author/attila-tóth",
            "https://blog.scrapinghub.com/web-data-analysis-exposing-nfl-player-salaries-with-python#comments-listing",
            "https://blog.scrapinghub.com/spidermon-scrapy-spider-monitoring",
            "...",
            "https://blog.scrapinghub.com/alternative-financial-data-quality"
         ],
         "links_count":48
      },
      {  
         "domain":"overthecap.com",
         "links":[  
            "https://overthecap.com/",
            "https://overthecap.com/position/quarterback/"
         ],
         "links_count":2
      },
      {  
         "domain":"github.com",
         "links":[  
            "https://github.com/zseta/NFL-Contracts",
            "https://github.com/zseta/NFL-Contracts",
            "https://github.com/zseta/NFL-Contracts"
         ],
         "links_count":3
      },
      {  
         "domain":"scrapingauthority.com",
         "links":[  
            "https://scrapingauthority.com/resources/"
         ],
         "links_count":1
      },
      {  
         "domain":"twitter.com",
         "links":[  
            "https://twitter.com/share",
            "https://twitter.com/ScrapingHub"
         ],
         "links_count":2
      }

   ]

}