Хабр Курсы для всех
РЕКЛАМА
Практикум, Хекслет, SkyPro, авторские курсы — собрали всех и попросили скидки. Осталось выбрать!
S=#decoder{offset=O}
...
<<_:O/binary, C, _/binary>> when ?IS_WHITESPACE(C) ->
tokenize_attributes(B, ?INC_CHAR(S, C), Acc);
{<<"html">>, [{<<"lang">>, <<"ru-RU">>}], [
{<<"head">>, [], []},
{<<"body">>, [], [<<"Hello, World!">>]}
]}
Интересно все вот эти мелкие бинари на самом деле являются ссылками на участки исходного большого бинари (бинари слайс это вроде называется), или создаются отдельные маленькие копии участков памяти?< не превращается в <. Если бы не это допущение, то пришлось бы автору парсера придумывать какие-то хаки, чтобы оставаться zero-copy.[<<"бинари до entity">>, <<"<">>, <<"бинари после entity">>], хотя работать с таким уже менее удобно.Присылайте свои реализации! Это очень просто!На гитхабе уже прислали реализации для ruby и golang. Обновляйте пост!
$ PLATFORMS="golang perl ruby" ./run.sh 100 =============== golang =============== ****************************** parser:./bin/bench_exp_html file:../page_google.html 0.270294 s real:0.27 user:0.27 sys:0.00 max RSS:4540 ****************************** parser:./bin/bench_gokogiri file:../page_google.html 0.315391 s real:0.32 user:0.28 sys:0.03 max RSS:78196 ****************************** parser:./bin/bench_h5 file:../page_google.html 5.436318 s real:5.43 user:5.42 sys:0.00 max RSS:6716 ****************************** parser:./bin/bench_exp_html file:../page_habrahabr-70330.html 5.826405 s real:5.83 user:5.78 sys:0.02 max RSS:50492 ****************************** parser:./bin/bench_gokogiri file:../page_habrahabr-70330.html 6.471175 s real:6.56 user:5.89 sys:0.65 max RSS:1943176 ****************************** parser:./bin/bench_h5 file:../page_habrahabr-70330.html 96.829145 s real:96.83 user:96.51 sys:0.02 max RSS:62356 ****************************** parser:./bin/bench_exp_html file:../page_habrahabr-index.html 0.293615 s real:0.29 user:0.29 sys:0.00 max RSS:4828 ****************************** parser:./bin/bench_gokogiri file:../page_habrahabr-index.html 0.361011 s real:0.36 user:0.34 sys:0.02 max RSS:97308 ****************************** parser:./bin/bench_h5 file:../page_habrahabr-index.html 4.111917 s real:4.11 user:4.09 sys:0.00 max RSS:5472 ****************************** parser:./bin/bench_exp_html file:../page_wikipedia.html 0.370002 s real:0.37 user:0.36 sys:0.00 max RSS:5100 ****************************** parser:./bin/bench_gokogiri file:../page_wikipedia.html 0.343510 s real:0.35 user:0.31 sys:0.04 max RSS:106832 ****************************** parser:./bin/bench_h5 file:../page_wikipedia.html 4.674953 s real:4.67 user:4.64 sys:0.01 max RSS:6564 =============== perl =============== ****************************** parser:mojo_parser.pm file:../page_google.html 3.89669394493103 s real:4.02 user:3.98 sys:0.01 max RSS:8096 ****************************** parser:mojo_parser.pm file:../page_habrahabr-70330.html 75.7632319927216 s real:75.81 user:75.56 sys:0.01 max RSS:36384 ****************************** parser:mojo_parser.pm file:../page_habrahabr-index.html 3.66110587120056 s real:3.70 user:3.68 sys:0.00 max RSS:8288 ****************************** parser:mojo_parser.pm file:../page_wikipedia.html 4.00335907936096 s real:4.04 user:4.01 sys:0.01 max RSS:8404 =============== ruby =============== ****************************** parser:nokogiri_parser.rb file:../page_google.html 0.343421056 s real:0.39 user:0.37 sys:0.01 max RSS:13124 ****************************** parser:nokogiri_parser.rb file:../page_habrahabr-70330.html 8.438090464 s real:8.49 user:8.45 sys:0.00 max RSS:36876 ****************************** parser:nokogiri_parser.rb file:../page_habrahabr-index.html 0.363908843 s real:0.40 user:0.40 sys:0.00 max RSS:14352 ****************************** parser:nokogiri_parser.rb file:../page_wikipedia.html 0.378928282 s real:0.41 user:0.41 sys:0.00 max RSS:14776 $ PLATFORMS="c-libxml2" ./run.sh 100 # добавил просто для сравнения =============== c-libxml2 =============== ****************************** parser:libxml2_html_parser.c file:../page_google.html 0.306867 s real:0.31 user:0.28 sys:0.00 max RSS:2244 ****************************** parser:libxml2_html_parser.c file:../page_habrahabr-70330.html 6.887496 s real:6.89 user:6.80 sys:0.06 max RSS:24292 ****************************** parser:libxml2_html_parser.c file:../page_habrahabr-index.html 0.291910 s real:0.29 user:0.28 sys:0.00 max RSS:2412 ****************************** parser:libxml2_html_parser.c file:../page_wikipedia.html 0.374730 s real:0.37 user:0.37 sys:0.00 max RSS:2504
$ PLATFORMS="dart" ./run.sh 100
===============
dart
===============
******************************
parser:html5lib_parser.dart file:../page_google.html
38864
real:39.13 user:38.76 sys:0.27 max RSS:73740
******************************
parser:html5lib_parser.dart file:../page_habrahabr-70330.html
686300
real:687.00 user:679.05 sys:5.72 max RSS:269588
******************************
parser:html5lib_parser.dart file:../page_habrahabr-index.html
34617
real:34.88 user:34.48 sys:0.26 max RSS:80804
******************************
parser:html5lib_parser.dart file:../page_wikipedia.html
36756
real:37.01 user:36.53 sys:0.33 max RSS:76560
$doc = new DomDocument();
$doc->recover = TRUE;
$doc->stricterrorchecking = FALSE;
$doc->loadHTML($html_string);
parser:tidy_simplexml.php file:../pages/page_10086.cn.html PHP Warning: simplexml_load_string(): Entity: line 2: parser error : PEReference in prolog in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): "%20http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): ^ in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 ****************************** parser:tidy_simplexml.php file:../pages/page_163.com.html PHP Warning: simplexml_load_string(): Entity: line 1192: parser error : Char 0xDF71 out of allowed range in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): ���<U+DF71>2012���<U+05FD>����ѡ</a> in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): ^ in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): Entity: line 1192: parser error : PCDATA invalid Char value 57201 in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 ****************************** parser:tidy_simplexml.php file:../pages/page_addthis.com.html PHP Warning: simplexml_load_string(): namespace error : Namespace prefix addthis for userid on a is not defined in /home/seriy/workspace/html_parser_bench/php -tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): addthis:userid="AddThis"></a> in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): ^ in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): namespace error : Namespace prefix addthis for userid on a is not defined in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): addthis:userid="addthis"></a> in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): ^ in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): namespace error : Namespace prefix addthis for usertype on a is not defined in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): addthis:usertype="company" addthis:userid="167173"></a> in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43 PHP Warning: simplexml_load_string(): ^ in /home/seriy/workspace/html_parser_bench/php-tidy/tidy_simplexml.php on line 43
Jquery style selector engine for HTML documents, in Go.Использует exp/html для парсинга HTML, но не знаю, модифицированный или нет.
Бенчмарк HTML парсеров