DESCRIPTION.rst 2.5 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
  1. # Protego
  2. ![build-badge](https://api.travis-ci.com/scrapy/protego.svg?branch=master)
  3. [![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
  4. ## Overview
  5. Protego is a pure-Python `robots.txt` parser with support for modern conventions.
  6. ## Requirements
  7. * Python 2.7 or Python 3.5+
  8. * Works on Linux, Windows, Mac OSX, BSD
  9. ## Install
  10. To install Protego, simply use pip:
  11. ```
  12. pip install protego
  13. ```
  14. ## Usage
  15. ```python
  16. >>> from protego import Protego
  17. >>> robotstxt = """
  18. ... User-agent: *
  19. ... Disallow: /
  20. ... Allow: /about
  21. ... Allow: /account
  22. ... Disallow: /account/contact$
  23. ... Disallow: /account/*/profile
  24. ... Crawl-delay: 4
  25. ... Request-rate: 10/1m # 10 requests every 1 minute
  26. ...
  27. ... Sitemap: http://example.com/sitemap-index.xml
  28. ... Host: http://example.co.in
  29. ... """
  30. >>> rp = Protego.parse(robotstxt)
  31. >>> rp.can_fetch("http://example.com/profiles", "mybot")
  32. False
  33. >>> rp.can_fetch("http://example.com/about", "mybot")
  34. True
  35. >>> rp.can_fetch("http://example.com/account", "mybot")
  36. True
  37. >>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
  38. False
  39. >>> rp.can_fetch("http://example.com/account/contact", "mybot")
  40. False
  41. >>> rp.crawl_delay("mybot")
  42. 4.0
  43. >>> rp.request_rate("mybot")
  44. RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
  45. >>> list(rp.sitemaps)
  46. ['http://example.com/sitemap-index.xml']
  47. >>> rp.preferred_host
  48. 'http://example.co.in'
  49. ```
  50. Using Protego with [Requests](https://3.python-requests.org/)
  51. ```python
  52. >>> from protego import Protego
  53. >>> import requests
  54. >>> r = requests.get("https://google.com/robots.txt")
  55. >>> rp = Protego.parse(r.text)
  56. >>> rp.can_fetch("https://google.com/search", "mybot")
  57. False
  58. >>> rp.can_fetch("https://google.com/search/about", "mybot")
  59. True
  60. >>> list(rp.sitemaps)
  61. ['https://www.google.com/sitemap.xml']
  62. ```
  63. ## Documentation
  64. Class `protego.Protego`:
  65. ### Properties
  66. * `sitemaps` {`list_iterator`} A list of sitemaps specified in `robots.txt`.
  67. * `preferred_host` {string} Preferred host specified in `robots.txt`.
  68. ### Methods
  69. * `parse(robotstxt_body)` Parse `robots.txt` and return a new instance of `protego.Protego`.
  70. * `can_fetch(url, user_agent)` Return True if the user agent can fetch the URL, otherwise return False.
  71. * `crawl_delay(user_agent)` Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.
  72. * `request_rate(user_agent)` Return the request rate specified for the user agent as a named tuple `RequestRate(requests, seconds, start_time, end_time)`. If nothing is specified, return None.