123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530 |
- Metadata-Version: 2.0
- Name: html5lib
- Version: 1.0.1
- Summary: HTML parser based on the WHATWG HTML specification
- Home-page: https://github.com/html5lib/html5lib-python
- Author: James Graham
- Author-email: james@hoppipolla.co.uk
- License: MIT License
- Description-Content-Type: UNKNOWN
- Platform: UNKNOWN
- Classifier: Development Status :: 5 - Production/Stable
- Classifier: Intended Audience :: Developers
- Classifier: License :: OSI Approved :: MIT License
- Classifier: Operating System :: OS Independent
- Classifier: Programming Language :: Python
- Classifier: Programming Language :: Python :: 2
- Classifier: Programming Language :: Python :: 2.7
- Classifier: Programming Language :: Python :: 3
- Classifier: Programming Language :: Python :: 3.3
- Classifier: Programming Language :: Python :: 3.4
- Classifier: Programming Language :: Python :: 3.5
- Classifier: Programming Language :: Python :: 3.6
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
- Classifier: Topic :: Text Processing :: Markup :: HTML
- Requires-Dist: six (>=1.9)
- Requires-Dist: webencodings
- Provides-Extra: all
- Requires-Dist: genshi; extra == 'all'
- Requires-Dist: chardet (>=2.2); extra == 'all'
- Provides-Extra: all
- Requires-Dist: datrie; platform_python_implementation == 'CPython' and extra == 'all'
- Requires-Dist: lxml; platform_python_implementation == 'CPython' and extra == 'all'
- Provides-Extra: chardet
- Requires-Dist: chardet (>=2.2); extra == 'chardet'
- Provides-Extra: datrie
- Requires-Dist: datrie; platform_python_implementation == 'CPython' and extra == 'datrie'
- Provides-Extra: genshi
- Requires-Dist: genshi; extra == 'genshi'
- Provides-Extra: lxml
- Requires-Dist: lxml; platform_python_implementation == 'CPython' and extra == 'lxml'
- html5lib
- ========
- .. image:: https://travis-ci.org/html5lib/html5lib-python.png?branch=master
- :target: https://travis-ci.org/html5lib/html5lib-python
- html5lib is a pure-python library for parsing HTML. It is designed to
- conform to the WHATWG HTML specification, as is implemented by all major
- web browsers.
- Usage
- -----
- Simple usage follows this pattern:
- .. code-block:: python
- import html5lib
- with open("mydocument.html", "rb") as f:
- document = html5lib.parse(f)
- or:
- .. code-block:: python
- import html5lib
- document = html5lib.parse("<p>Hello World!")
- By default, the ``document`` will be an ``xml.etree`` element instance.
- Whenever possible, html5lib chooses the accelerated ``ElementTree``
- implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x).
- Two other tree types are supported: ``xml.dom.minidom`` and
- ``lxml.etree``. To use an alternative format, specify the name of
- a treebuilder:
- .. code-block:: python
- import html5lib
- with open("mydocument.html", "rb") as f:
- lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
- When using with ``urllib2`` (Python 2), the charset from HTTP should be
- pass into html5lib as follows:
- .. code-block:: python
- from contextlib import closing
- from urllib2 import urlopen
- import html5lib
- with closing(urlopen("http://example.com/")) as f:
- document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))
- When using with ``urllib.request`` (Python 3), the charset from HTTP
- should be pass into html5lib as follows:
- .. code-block:: python
- from urllib.request import urlopen
- import html5lib
- with urlopen("http://example.com/") as f:
- document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())
- To have more control over the parser, create a parser object explicitly.
- For instance, to make the parser raise exceptions on parse errors, use:
- .. code-block:: python
- import html5lib
- with open("mydocument.html", "rb") as f:
- parser = html5lib.HTMLParser(strict=True)
- document = parser.parse(f)
- When you're instantiating parser objects explicitly, pass a treebuilder
- class as the ``tree`` keyword argument to use an alternative document
- format:
- .. code-block:: python
- import html5lib
- parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
- minidom_document = parser.parse("<p>Hello World!")
- More documentation is available at https://html5lib.readthedocs.io/.
- Installation
- ------------
- html5lib works on CPython 2.7+, CPython 3.3+ and PyPy. To install it,
- use:
- .. code-block:: bash
- $ pip install html5lib
- Optional Dependencies
- ---------------------
- The following third-party libraries may be used for additional
- functionality:
- - ``datrie`` can be used under CPython to improve parsing performance
- (though in almost all cases the improvement is marginal);
- - ``lxml`` is supported as a tree format (for both building and
- walking) under CPython (but *not* PyPy where it is known to cause
- segfaults);
- - ``genshi`` has a treewalker (but not builder); and
- - ``chardet`` can be used as a fallback when character encoding cannot
- be determined.
- Bugs
- ----
- Please report any bugs on the `issue tracker
- <https://github.com/html5lib/html5lib-python/issues>`_.
- Tests
- -----
- Unit tests require the ``pytest`` and ``mock`` libraries and can be
- run using the ``py.test`` command in the root directory.
- Test data are contained in a separate `html5lib-tests
- <https://github.com/html5lib/html5lib-tests>`_ repository and included
- as a submodule, thus for git checkouts they must be initialized::
- $ git submodule init
- $ git submodule update
- If you have all compatible Python implementations available on your
- system, you can run tests on all of them using the ``tox`` utility,
- which can be found on PyPI.
- Questions?
- ----------
- There's a mailing list available for support on Google Groups,
- `html5lib-discuss <http://groups.google.com/group/html5lib-discuss>`_,
- though you may get a quicker response asking on IRC in `#whatwg on
- irc.freenode.net <http://wiki.whatwg.org/wiki/IRC>`_.
- Change Log
- ----------
- 1.0.1
- ~~~~~
- Released on December 7, 2017
- Breaking changes:
- * Drop support for Python 2.6. (#330) (Thank you, Hugo, Will Kahn-Greene!)
- * Remove ``utils/spider.py`` (#353) (Thank you, Jon Dufresne!)
- Features:
- * Improve documentation. (#300, #307) (Thank you, Jon Dufresne, Tom Most,
- Will Kahn-Greene!)
- * Add iframe seamless boolean attribute. (Thank you, Ritwik Gupta!)
- * Add itemscope as a boolean attribute. (#194) (Thank you, Jonathan Vanasco!)
- * Support Python 3.6. (#333) (Thank you, Jon Dufresne!)
- * Add CI support for Windows using AppVeyor. (Thank you, John Vandenberg!)
- * Improve testing and CI and add code coverage (#323, #334), (Thank you, Jon
- Dufresne, John Vandenberg, Geoffrey Sneddon, Will Kahn-Greene!)
- * Semver-compliant version number.
- Bug fixes:
- * Add support for setuptools < 18.5 to support environment markers. (Thank you,
- John Vandenberg!)
- * Add explicit dependency for six >= 1.9. (Thank you, Eric Amorde!)
- * Fix regexes to work with Python 3.7 regex adjustments. (#318, #379) (Thank
- you, Benedikt Morbach, Ville Skyttä, Mark Vasilkov!)
- * Fix alphabeticalattributes filter namespace bug. (#324) (Thank you, Will
- Kahn-Greene!)
- * Include license file in generated wheel package. (#350) (Thank you, Jon
- Dufresne!)
- * Fix annotation-xml typo. (#339) (Thank you, Will Kahn-Greene!)
- * Allow uppercase hex chararcters in CSS colour check. (#377) (Thank you,
- Komal Dembla, Hugo!)
- 1.0
- ~~~
- Released and unreleased on December 7, 2017. Badly packaged release.
- 0.999999999/1.0b10
- ~~~~~~~~~~~~~~~~~~
- Released on July 15, 2016
- * Fix attribute order going to the tree builder to be document order
- instead of reverse document order(!).
- 0.99999999/1.0b9
- ~~~~~~~~~~~~~~~~
- Released on July 14, 2016
- * **Added ordereddict as a mandatory dependency on Python 2.6.**
- * Added ``lxml``, ``genshi``, ``datrie``, ``charade``, and ``all``
- extras that will do the right thing based on the specific
- interpreter implementation.
- * Now requires the ``mock`` package for the testsuite.
- * Cease supporting DATrie under PyPy.
- * **Remove PullDOM support, as this hasn't ever been properly
- tested, doesn't entirely work, and as far as I can tell is
- completely unused by anyone.**
- * Move testsuite to ``py.test``.
- * **Fix #124: move to webencodings for decoding the input byte stream;
- this makes html5lib compliant with the Encoding Standard, and
- introduces a required dependency on webencodings.**
- * **Cease supporting Python 3.2 (in both CPython and PyPy forms).**
- * **Fix comments containing double-dash with lxml 3.5 and above.**
- * **Use scripting disabled by default (as we don't implement
- scripting).**
- * **Fix #11, avoiding the XSS bug potentially caused by serializer
- allowing attribute values to be escaped out of in old browser versions,
- changing the quote_attr_values option on serializer to take one of
- three values, "always" (the old True value), "legacy" (the new option,
- and the new default), and "spec" (the old False value, and the old
- default).**
- * **Fix #72 by rewriting the sanitizer to apply only to treewalkers
- (instead of the tokenizer); as such, this will require amending all
- callers of it to use it via the treewalker API.**
- * **Drop support of charade, now that chardet is supported once more.**
- * **Replace the charset keyword argument on parse and related methods
- with a set of keyword arguments: override_encoding, transport_encoding,
- same_origin_parent_encoding, likely_encoding, and default_encoding.**
- * **Move filters._base, treebuilder._base, and treewalkers._base to .base
- to clarify their status as public.**
- * **Get rid of the sanitizer package. Merge sanitizer.sanitize into the
- sanitizer.htmlsanitizer module and move that to sanitizer. This means
- anyone who used sanitizer.sanitize or sanitizer.HTMLSanitizer needs no
- code changes.**
- * **Rename treewalkers.lxmletree to .etree_lxml and
- treewalkers.genshistream to .genshi to have a consistent API.**
- * Move a whole load of stuff (inputstream, ihatexml, trie, tokenizer,
- utils) to be underscore prefixed to clarify their status as private.
- 0.9999999/1.0b8
- ~~~~~~~~~~~~~~~
- Released on September 10, 2015
- * Fix #195: fix the sanitizer to drop broken URLs (it threw an
- exception between 0.9999 and 0.999999).
- 0.999999/1.0b7
- ~~~~~~~~~~~~~~
- Released on July 7, 2015
- * Fix #189: fix the sanitizer to allow relative URLs again (as it did
- prior to 0.9999/1.0b5).
- 0.99999/1.0b6
- ~~~~~~~~~~~~~
- Released on April 30, 2015
- * Fix #188: fix the sanitizer to not throw an exception when sanitizing
- bogus data URLs.
- 0.9999/1.0b5
- ~~~~~~~~~~~~
- Released on April 29, 2015
- * Fix #153: Sanitizer fails to treat some attributes as URLs. Despite how
- this sounds, this has no known security implications. No known version
- of IE (5.5 to current), Firefox (3 to current), Safari (6 to current),
- Chrome (1 to current), or Opera (12 to current) will run any script
- provided in these attributes.
- * Pass error message to the ParseError exception in strict parsing mode.
- * Allow data URIs in the sanitizer, with a whitelist of content-types.
- * Add support for Python implementations that don't support lone
- surrogates (read: Jython). Fixes #2.
- * Remove localization of error messages. This functionality was totally
- unused (and untested that everything was localizable), so we may as
- well follow numerous browsers in not supporting translating technical
- strings.
- * Expose treewalkers.pprint as a public API.
- * Add a documentEncoding property to HTML5Parser, fix #121.
- 0.999
- ~~~~~
- Released on December 23, 2013
- * Fix #127: add work-around for CPython issue #20007: .read(0) on
- http.client.HTTPResponse drops the rest of the content.
- * Fix #115: lxml treewalker can now deal with fragments containing, at
- their root level, text nodes with non-ASCII characters on Python 2.
- 0.99
- ~~~~
- Released on September 10, 2013
- * No library changes from 1.0b3; released as 0.99 as pip has changed
- behaviour from 1.4 to avoid installing pre-release versions per
- PEP 440.
- 1.0b3
- ~~~~~
- Released on July 24, 2013
- * Removed ``RecursiveTreeWalker`` from ``treewalkers._base``. Any
- implementation using it should be moved to
- ``NonRecursiveTreeWalker``, as everything bundled with html5lib has
- for years.
- * Fix #67 so that ``BufferedStream`` to correctly returns a bytes
- object, thereby fixing any case where html5lib is passed a
- non-seekable RawIOBase-like object.
- 1.0b2
- ~~~~~
- Released on June 27, 2013
- * Removed reordering of attributes within the serializer. There is now
- an ``alphabetical_attributes`` option which preserves the previous
- behaviour through a new filter. This allows attribute order to be
- preserved through html5lib if the tree builder preserves order.
- * Removed ``dom2sax`` from DOM treebuilders. It has been replaced by
- ``treeadapters.sax.to_sax`` which is generic and supports any
- treewalker; it also resolves all known bugs with ``dom2sax``.
- * Fix treewalker assertions on hitting bytes strings on
- Python 2. Previous to 1.0b1, treewalkers coped with mixed
- bytes/unicode data on Python 2; this reintroduces this prior
- behaviour on Python 2. Behaviour is unchanged on Python 3.
- 1.0b1
- ~~~~~
- Released on May 17, 2013
- * Implementation updated to implement the `HTML specification
- <http://www.whatwg.org/specs/web-apps/current-work/>`_ as of 5th May
- 2013 (`SVN <http://svn.whatwg.org/webapps/>`_ revision r7867).
- * Python 3.2+ supported in a single codebase using the ``six`` library.
- * Removed support for Python 2.5 and older.
- * Removed the deprecated Beautiful Soup 3 treebuilder.
- ``beautifulsoup4`` can use ``html5lib`` as a parser instead. Note that
- since it doesn't support namespaces, foreign content like SVG and
- MathML is parsed incorrectly.
- * Removed ``simpletree`` from the package. The default tree builder is
- now ``etree`` (using the ``xml.etree.cElementTree`` implementation if
- available, and ``xml.etree.ElementTree`` otherwise).
- * Removed the ``XHTMLSerializer`` as it never actually guaranteed its
- output was well-formed XML, and hence provided little of use.
- * Removed default DOM treebuilder, so ``html5lib.treebuilders.dom`` is no
- longer supported. ``html5lib.treebuilders.getTreeBuilder("dom")`` will
- return the default DOM treebuilder, which uses ``xml.dom.minidom``.
- * Optional heuristic character encoding detection now based on
- ``charade`` for Python 2.6 - 3.3 compatibility.
- * Optional ``Genshi`` treewalker support fixed.
- * Many bugfixes, including:
- * #33: null in attribute value breaks XML AttValue;
- * #4: nested, indirect descendant, <button> causes infinite loop;
- * `Google Code 215
- <http://code.google.com/p/html5lib/issues/detail?id=215>`_: Properly
- detect seekable streams;
- * `Google Code 206
- <http://code.google.com/p/html5lib/issues/detail?id=206>`_: add
- support for <video preload=...>, <audio preload=...>;
- * `Google Code 205
- <http://code.google.com/p/html5lib/issues/detail?id=205>`_: add
- support for <video poster=...>;
- * `Google Code 202
- <http://code.google.com/p/html5lib/issues/detail?id=202>`_: Unicode
- file breaks InputStream.
- * Source code is now mostly PEP 8 compliant.
- * Test harness has been improved and now depends on ``nose``.
- * Documentation updated and moved to https://html5lib.readthedocs.io/.
- 0.95
- ~~~~
- Released on February 11, 2012
- 0.90
- ~~~~
- Released on January 17, 2010
- 0.11.1
- ~~~~~~
- Released on June 12, 2008
- 0.11
- ~~~~
- Released on June 10, 2008
- 0.10
- ~~~~
- Released on October 7, 2007
- 0.9
- ~~~
- Released on March 11, 2007
- 0.2
- ~~~
- Released on January 8, 2007
|