Metadata-Version: 2.0 Name: Protego Version: 0.1.16 Summary: Pure-Python robots.txt parser with support for modern conventions Home-page: UNKNOWN Author: Anubhav Patel Author-email: anubhavp28@gmail.com License: BSD Keywords: robots.txt,parser,robots,rep Platform: UNKNOWN Classifier: Development Status :: 4 - Beta Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: BSD License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 2 Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Programming Language :: Python :: Implementation :: PyPy Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.* Description-Content-Type: text/markdown Requires-Dist: six # Protego ![build-badge](https://api.travis-ci.com/scrapy/protego.svg?branch=master) [![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/) ## Overview Protego is a pure-Python `robots.txt` parser with support for modern conventions. ## Requirements * Python 2.7 or Python 3.5+ * Works on Linux, Windows, Mac OSX, BSD ## Install To install Protego, simply use pip: ``` pip install protego ``` ## Usage ```python >>> from protego import Protego >>> robotstxt = """ ... User-agent: * ... Disallow: / ... Allow: /about ... Allow: /account ... Disallow: /account/contact$ ... Disallow: /account/*/profile ... Crawl-delay: 4 ... Request-rate: 10/1m # 10 requests every 1 minute ... ... Sitemap: http://example.com/sitemap-index.xml ... Host: http://example.co.in ... """ >>> rp = Protego.parse(robotstxt) >>> rp.can_fetch("http://example.com/profiles", "mybot") False >>> rp.can_fetch("http://example.com/about", "mybot") True >>> rp.can_fetch("http://example.com/account", "mybot") True >>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot") False >>> rp.can_fetch("http://example.com/account/contact", "mybot") False >>> rp.crawl_delay("mybot") 4.0 >>> rp.request_rate("mybot") RequestRate(requests=10, seconds=60, start_time=None, end_time=None) >>> list(rp.sitemaps) ['http://example.com/sitemap-index.xml'] >>> rp.preferred_host 'http://example.co.in' ``` Using Protego with [Requests](https://3.python-requests.org/) ```python >>> from protego import Protego >>> import requests >>> r = requests.get("https://google.com/robots.txt") >>> rp = Protego.parse(r.text) >>> rp.can_fetch("https://google.com/search", "mybot") False >>> rp.can_fetch("https://google.com/search/about", "mybot") True >>> list(rp.sitemaps) ['https://www.google.com/sitemap.xml'] ``` ## Documentation Class `protego.Protego`: ### Properties * `sitemaps` {`list_iterator`} A list of sitemaps specified in `robots.txt`. * `preferred_host` {string} Preferred host specified in `robots.txt`. ### Methods * `parse(robotstxt_body)` Parse `robots.txt` and return a new instance of `protego.Protego`. * `can_fetch(url, user_agent)` Return True if the user agent can fetch the URL, otherwise return False. * `crawl_delay(user_agent)` Return the crawl delay specified for the user agent as a float. If nothing is specified, return None. * `request_rate(user_agent)` Return the request rate specified for the user agent as a named tuple `RequestRate(requests, seconds, start_time, end_time)`. If nothing is specified, return None.