METADATA 3.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
  1. Metadata-Version: 2.0
  2. Name: Protego
  3. Version: 0.1.16
  4. Summary: Pure-Python robots.txt parser with support for modern conventions
  5. Home-page: UNKNOWN
  6. Author: Anubhav Patel
  7. Author-email: anubhavp28@gmail.com
  8. License: BSD
  9. Keywords: robots.txt,parser,robots,rep
  10. Platform: UNKNOWN
  11. Classifier: Development Status :: 4 - Beta
  12. Classifier: Intended Audience :: Developers
  13. Classifier: License :: OSI Approved :: BSD License
  14. Classifier: Operating System :: OS Independent
  15. Classifier: Programming Language :: Python
  16. Classifier: Programming Language :: Python :: 2
  17. Classifier: Programming Language :: Python :: 2.7
  18. Classifier: Programming Language :: Python :: 3
  19. Classifier: Programming Language :: Python :: 3.5
  20. Classifier: Programming Language :: Python :: 3.6
  21. Classifier: Programming Language :: Python :: 3.7
  22. Classifier: Programming Language :: Python :: Implementation :: CPython
  23. Classifier: Programming Language :: Python :: Implementation :: PyPy
  24. Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*
  25. Description-Content-Type: text/markdown
  26. Requires-Dist: six
  27. # Protego
  28. ![build-badge](https://api.travis-ci.com/scrapy/protego.svg?branch=master)
  29. [![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
  30. ## Overview
  31. Protego is a pure-Python `robots.txt` parser with support for modern conventions.
  32. ## Requirements
  33. * Python 2.7 or Python 3.5+
  34. * Works on Linux, Windows, Mac OSX, BSD
  35. ## Install
  36. To install Protego, simply use pip:
  37. ```
  38. pip install protego
  39. ```
  40. ## Usage
  41. ```python
  42. >>> from protego import Protego
  43. >>> robotstxt = """
  44. ... User-agent: *
  45. ... Disallow: /
  46. ... Allow: /about
  47. ... Allow: /account
  48. ... Disallow: /account/contact$
  49. ... Disallow: /account/*/profile
  50. ... Crawl-delay: 4
  51. ... Request-rate: 10/1m # 10 requests every 1 minute
  52. ...
  53. ... Sitemap: http://example.com/sitemap-index.xml
  54. ... Host: http://example.co.in
  55. ... """
  56. >>> rp = Protego.parse(robotstxt)
  57. >>> rp.can_fetch("http://example.com/profiles", "mybot")
  58. False
  59. >>> rp.can_fetch("http://example.com/about", "mybot")
  60. True
  61. >>> rp.can_fetch("http://example.com/account", "mybot")
  62. True
  63. >>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
  64. False
  65. >>> rp.can_fetch("http://example.com/account/contact", "mybot")
  66. False
  67. >>> rp.crawl_delay("mybot")
  68. 4.0
  69. >>> rp.request_rate("mybot")
  70. RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
  71. >>> list(rp.sitemaps)
  72. ['http://example.com/sitemap-index.xml']
  73. >>> rp.preferred_host
  74. 'http://example.co.in'
  75. ```
  76. Using Protego with [Requests](https://3.python-requests.org/)
  77. ```python
  78. >>> from protego import Protego
  79. >>> import requests
  80. >>> r = requests.get("https://google.com/robots.txt")
  81. >>> rp = Protego.parse(r.text)
  82. >>> rp.can_fetch("https://google.com/search", "mybot")
  83. False
  84. >>> rp.can_fetch("https://google.com/search/about", "mybot")
  85. True
  86. >>> list(rp.sitemaps)
  87. ['https://www.google.com/sitemap.xml']
  88. ```
  89. ## Documentation
  90. Class `protego.Protego`:
  91. ### Properties
  92. * `sitemaps` {`list_iterator`} A list of sitemaps specified in `robots.txt`.
  93. * `preferred_host` {string} Preferred host specified in `robots.txt`.
  94. ### Methods
  95. * `parse(robotstxt_body)` Parse `robots.txt` and return a new instance of `protego.Protego`.
  96. * `can_fetch(url, user_agent)` Return True if the user agent can fetch the URL, otherwise return False.
  97. * `crawl_delay(user_agent)` Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.
  98. * `request_rate(user_agent)` Return the request rate specified for the user agent as a named tuple `RequestRate(requests, seconds, start_time, end_time)`. If nothing is specified, return None.