HTML::Element - Class for objects that represent HTML elements
use HTML::Element;
$a = HTML::Element->new('a', href => 'http://www.perl.com/');
$a->push_content("The Perl Homepage");
$tag = $a->tag;
print "$tag starts out as:", $a->starttag, "\n";
print "$tag ends as:", $a->endtag, "\n";
print "$tag\'s href attribute is: ", $a->attr('href'), "\n";
$links_r = $a->extract_links(); print "Hey, I found ", scalar(@$links_r), " links.\n"; print "And that, as HTML, is: ", $a->as_HTML, "\n"; $a = $a->delete;
Objects of the HTML::Element class can be used to represent elements of HTML. These objects have attributes, notably attributes that designates the elements's parent and content. The content is an array of text segments and other HTML::Element objects. A tree with HTML::Element objects as nodes can represent the syntax tree for a HTML document.
It may occur to you to wonder what exactly a ``tree'' is, and how it's represented in memory. Consider this HTML document:
<html lang='en-US'>
<head>
<title>Stuff</title>
<meta name='author' content='Jojo'>
</head>
<body>
<h1>I like potatoes!</h1>
</body>
</html>
Building a syntax tree out of it makes a tree-structure in memory that could be diagrammed as:
html (lang='en-US')
/ \
/ \
/ \
head body
/\ \
/ \ \
/ \ \
title meta h1
| (name='author', |
"Stuff" content='Jojo') "I like potatoes"
This is the traditional way to diagram a tree, with the ``root'' at the top, and it's this kind of diagram that people have in mind when they say, for example, that ``the meta element is under the head element instead of under the body element''. (The same is also said with ``inside'' instead of ``under'' -- the use of ``inside'' makes more sense when you're looking at the HTML source.)
Another way to represent the above tree is with indenting:
html (attributes: lang='en-US')
head
title
"Stuff"
meta (attributes: name='author' content='Jojo')
body
h1
"I like potatoes"
Incidentally, diagramming with indenting works much better for very large trees, and is easier for a program to generate. The $tree->dump method uses indentation just that way.
However you diagram the tree, it's stored the same in memory -- it's a network of objects, each of which has attributes like so:
element #1: _tag: 'html'
_parent: none
_content: [element #2, element #5]
lang: 'en-US'
element #2: _tag: 'head'
_parent: element #1
_content: [element #3, element #4]
element #3: _tag: 'title'
_parent: element #2
_content: [text segment "Stuff"]
element #4 _tag: 'meta'
_parent: element #2
_content: none
name: author
content: Jojo
element #5 _tag: 'body'
_parent: element #1
_content: [element #6]
element #6 _tag: 'h1'
_parent: element #5
_content: [text segment "I like potatoes"]
The ``treeness'' of the tree-structure that these elements comprise is not an aspect of any particular object, but is emergent from the relatedness attributes (_parent and _content) of these element-objects and from how you use them to get from element to element.
While you could access the content of a tree by writing code that says ``access the 'src' attribute of the root's first child's seventh child's third child'', you're more likely to have to scan the contents of a tree, looking for whatever nodes, or kinds of nodes, you want to do something with. The most straightforward way to look over a tree is to ``traverse'' it; an HTML::Element method ($h->traverse) is provided for this purpose; and several other HTML::Element methods are based on it.
(For everything you ever wanted to know about trees, and then some, see Donald Knuth's The Art of Computer Programming, Volume 1.)
This constructor method returns a new HTML::Element object. The tag name is a required argument; it will be forced to lowercase. Optionally, you can specify other initial attributes at object creation time.
Returns (optionally sets) the tag name (also known as the generic identifier) for the element $h. The tag name is always converted to lower case.
Returns the content of this element -- i.e., what is inside/under this element. The return value is either undef (which you should understand to mean no content), or a reference to the array of content items, each of which is either a text segment, or an HTML::Element object.
Returns (optionally sets) the parent for this element. (If you're thinking about using this to attach or detach nodes, instead consider $new_parent->push_content($h), $new_parent->unshift_content($h), or $h->detach.)
Returns (optionally sets) the implicit attribute. This attribute is used to indicate that the element was not originally present in the source, but was added to the parse tree (by HTML::TreeBuilder, for example) in order to conform to the rules of HTML structure.
Returns (and optionally sets) the ``current position'' pointer of $h. This
``pos'' attribute is a pointer used during some parsing operations, whose
value is whatever HTML::Element element at or under $h is
currently ``open'', where $h->insert_element(NEW) will actually insert a
new element.
(This has nothing to do with the Perl function called ``pos'', for controlling where regular expression matching starts.)
If you set $h->pos($element), be sure that $element is
either $h, or an element under $h.
If you've been modifying the tree under $h and are no longer
sure $h->pos is valid, you can enforce validity with:
$h->pos(undef) unless $h->pos->is_under($h);
Returns (optionally sets) the value of the given attribute of $h. The attribute name (but not the value, if provided) is forced to lowercase. If setting a new value, the old value of that attribute is returned.
While you theoretically could modify a tree by directly manipulating objects' parent and content attributes, it's much simpler (and less error-prone), to use these methods:
Inserts a new element under the element at $h->pos(). Then updates
$h->pos() to point to the inserted element, unless $element
is a prototypically empty element like ``br'', ``hr'', ``img'', etc. The
new $h->pos() is returned.
Adds the specified items to the end of the content list of the element $h. The items of content to be added should each be either a text segment (a string) or an HTML::Element object.
Adds the specified items to the beginning of the content list of the element $h. The items of content to be added should each be either a text segment (a string) or an HTML::Element object.
Removes the elements designated by $offset and
$length from the content-list of element $h, and replaces them
with the elements of the following list, if any. Returns the elements
removed from the array. If $offset is negative, then it starts
that far from the end of the array. If $length and the
following list are omitted, removes everything from $offset
onward.
The items of content to be added should each be either a text segment (a string) or an HTML::Element object, and should not already be children of $h.
This unlinks $h from its parent, by setting its 'parent'
attribute to undef, and by removing it from the content list of its parent
(if it had one). The return value is the parent that was detached from (or
undef, if $h had no parent to start with). Note that neither
$h nor its parent are explicitly destroyed.
This replaces $h in its parent's content list with its own
content. The element $h (which by then has no parent or
content of its own) is returned. This causes a fatal error if
$h has no parent. Also, note that this does not destroy
$h -- use $h->replace_with_content->delete if you need
that.
Clears the content of $h, calling $i->delete for each content element.
Returns $h.
Removes this element from its parent (if it has one) and explicitly destroys the element and all its descendants. The return value is undef.
Perl uses garbage collection based on reference counting; when no references to a data structure exist, it's implicitly destroyed -- i.e., when no value anywhere points to a given object anymore, Perl knows it can free up the memory that the now-unused object occupies.
But this fails with HTML::Element trees, because a parent element always holds references to its children, and its children elements hold references to the parent, so no element ever looks like it's not in use. So, to destroy those elements, you need to call $h->delete on the parent.
Prints the element and all its children to STDOUT, in a format useful only for debugging. The structure of the document is shown by indentation (no end tags).
Returns a string representing in HTML the the element and its children. The
optional argument $entities specifies a string of the entities to encode. For compatibility with
previous versions, specify '<>&' here. If omitted or undef, all unsafe characters are encoded as HTML entities. See HTML::Entities for details.
Returns a string that represents only the text parts of the element's descendants. Entities are decoded to corresponding ISO-8859-1 (Latin-1) characters. See HTML::Entities for more information.
Returns a string representing the complete start tag for the element. I.e.,
leading ``<'', tag name, attributes, and trailing ``>''. Attributes values that
don't consist entirely of digits are surrounded with double-quotes, and
appropriate characters are encoded. If $entities is omitted or
undef, all unsafe characters are encoded as HTML entities. See HTML::Entities for details. If you specify some value for $entities, remember to include
the double-quote character in it. (Previous versions of this module would
basically behave as if
'&">' were specified for $entities.)
Returns a string representing the complete end tag for this element. I.e., ``</'', tag name, and ``>''.
These methods all involve some structural aspect of the tree; either they report some aspect of the tree's structure, or they involve traversal down the tree, or walking up the tree.
Returns true if the $h element is, or is contained anywhere
inside an element that is any of the ones listed, or whose tag name is any
of the tag names listed.
Returns true if $h has no content, i.e., has no elements or
text segments under it. In other words, this returns true if
$h is a leaf node, AKA a terminal node. Do not confuse this
sense of ``empty'' with another sense that it can have in SGML/HTML/XML
terminology, which means that the element in question is of the type (like
HTML's ``hr'', ``br'', ``img'', etc.) that can't have any content.
That is, a particular ``p'' element may happen to have no content, so $that_p_element->is_empty will be true -- even though the prototypical ``p'' element isn't ``empty'' (in the way that the prototypical ``hr'' element is).
Return the index of the element in its parent's contents array, such that
$h would equal $h->parent->content->[$h->pindex],
assuming $h isn't root. If the element $h is
root, then $h->pindex returns undef.
Returns a string representing the location of this node in the tree. The address consists of numbers joined by a '.', starting with '0', and followed by the pindexes of the nodes in the tree that are ancestors of $h, starting from the top.
So if the way to get to a node starting at the root is to go to child 2 of the root, then child 10 of that, and then child 0 of that, and then you're there -- then that node's address is ``0.2.10.0''.
As a bit of a special case, the address of the root is simply ``0''.
I forsee this being used mainly for debugging.
This returns the node (whether element or text-segment) at the given
address in the tree that $h is a part of. (That is, the
address is resolved starting from $h->root.)
If there is no node at the given address, this returns undef.
Returns a number expressing $h's depth within its tree, i.e., how many
steps away it is from the root. If $h has no parent (i.e., is
root), its depth is 0.
Returns the element that's the top of $h's tree. If $h is
root, this just returns $h. (If you want to test whether $h is the root, instead of asking what its root is, just test
not($h->parent).)
Returns the list of $h's ancestors, starting with its parent, and then that
parent's parent, and so on, up to the root. If $h is root,
this returns an empty list.
If you simply want a count of the number of elements in $h's lineage, use $h->depth.
Returns the list of the tag names of $h's ancestors, starting with its
parent, and that parent's parent, and so on, up to the root. If
$h is root, this returns an empty list. Example output:
('html', 'body', 'table', 'tr', 'td', 'em')
In list context, returns the list of all $h's descendant elements, listed in pre-order (i.e., an element appears before its content-elements). Text segments do not appear in the list. In scalar context, returns a count of all such elements.
Traverse the element and all of its children. For each node visited, the callback routine is called with these arguments:
$_[0] : the node (element or text segment),
$_[1] : a startflag, and
$_[2] : the depth
If the $ignore_text parameter is given and true, then the
callback will not be called for text content.
The startflag is 1 when we enter a node (i.e., in pre-order calls) and 0 when we leave the node (in post-order calls). Note, however, that post-order calls don't happen for nodes that are text segments or elements that are prototypically empty (like ``br'', ``hr'', etc.).
If the returned value is false from the pre-order call to the callback, then the children will not be traversed, nor will the callback be called in post-order for that node.
If $ignore_text is given and false (so we do visit text nodes, instead of ignoring them), then when text nodes are
visited, we will also pass two extra arguments to the callback:
$_[3] : the element that's the parent
of this text node
$_[4] : the index of this text node
in its parent's content list
The source code for HTML::Element and HTML::TreeBuilder contain several examples of the use of the ``traverse'' method.
(Note: you should not change the structure of a tree while you are traversing it.)
In list context, returns a list of elements at or under $h
that have any of the specified tag names. In scalar context, returns the
first (in pre-order traversal of the tree) such element found, or undef if
none.
In a list context, returns a list of elements at or under $h
that have the specified attribute, and have the given value for that
attribute. In a scalar context, returns the first (in pre-order traversal
of the tree) such element found, or undef if none.
In list context, returns a list consisting of the values of the given
attribute for $self and for all its ancestors starting from
$self and working its way up. Nodes with no such attribute are
skipped. (``attr_get_i'' stands for ``attribute get, with inheritance''.)
In scalar context, returns the first such value, or undef if none.
Consider a document consisting of:
<html lang='i-klingon'>
<head><title>Pati Pata</title></head>
<body>
<h1 lang='la'>Stuff</h1>
<p lang='es-MX' align='center'>
Foo bar baz <cite>Quux</cite>.
</p>
<p>Hooboy.</p>
</body>
</html>
If $h is the ``cite'' element, $h->attr_get_i(``lang'') in
list context will return the list ('es-MX', 'i-klingon'). In scalar
context, it will return the value 'es-MX'.
Returns links found by traversing the element and all of its children and looking for attributes (like ``href'' in an ``a'' element, or ``src'' in an ``img'' element) whose values represent links. The return value is a reference to an array. Each element of the array is reference to an array with two items: the link-value and a the element that has the attribute with that link-value. You may or may not end up using the element itself -- for some purposes, you may use only the link value.
You might specify that you want to extract links from just some kinds of elements (instead of the default, which is to extract links from all the kinds of elements known to have attributes whose values represent links). For instance, if you want to extract links from only ``a'' and ``img'' elements, you could code it like this:
for (@{ $e->extract_links('a', 'img') }) {
my($link, $element) = @$_;
print
"Hey, there's a ", $element->tag,
" that links to $link\n";
}
* If you want to free the memory associated with a tree built of HTML::Element nodes then you will have to delete it explicitly. See the $h->delete method, above.
* There's almost nothing to stop you from making a ``tree'' with cyclicities (loops) in it, which could, for example, make the traverse method go into an infinite loop. So don't make cyclicities! (If all you're doing is parsing HTML files, and looking at the resulting trees, this will never be a problem for you.)
* There's no way to represent comments or processing directives in a tree with HTML::Elements.
HTML::AsSubs, HTML::TreeBuilder
Copyright 1995-1998 Gisle Aas, 1999 Sean M. Burke.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Original author Gisle Aas <gisle@aas.no>; current maintainer Sean M. Burke, <sburke@netadventure.net>