133x Filetype PDF File size 0.21 MB Source: aclanthology.org
Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction Adrien Barbaresi Center for Digital Lexicography of German (ZDL) Berlin-Brandenburg Academy of Sciences (BBAW) Jgerstr. 22-23, 10117 Berlin, Germany barbaresi@bbaw.de Abstract Asignificant challenge lies in the ability to ex- Anessential operation in web corpus construc- tract and pre-process web data to meet scientific tion consists in retaining the desired content expectations with respect to text quality. An es- while discarding the rest. Another challenge sential operation in corpus construction consists finding one’s way through websites. This ar- in retaining the desired content while discarding ticle introduces a text discovery and extrac- the rest, a task carrying various names referring to tion tool published under open-source license. specific subtasks or to pre-processing as a whole: Its installation and use is straightforward, no- webscraping, boilerplate removal, web page seg- tably from Python and on the command-line. mentation, web page cleaning, template extraction, The software allows for main text, comments or content extraction. This step is sometimes over- and metadata extraction, while also providing looked although it involves a series of design de- building blocks for web crawling tasks. A cisions and turning points in data processing. De- comparativeevaluationonreal-worlddataalso showsitsinterestaswellastheperformanceof pendingonthepurposeofdatacollection,adequate other available solutions. filtering and quality assessment can be crucial. It Thecontributionsofthispaperarethreefold: it has a significant impact on a wide range of down- references the software, features a benchmark, stream applications like text analysis, information andprovides a meaningful baseline for similar retrieval, link analysis, page adaptation to other ter- tasks. The tool performs significantly better minalsandscreens,andespeciallynaturallanguage than other open-source solutions in this evalu- processing pipelines. ation and in external benchmarks. Another challenge is how to find one’s way 1 Introduction through the Web, notably as linguistic data are 1.1 Gathering texts from the Web gathered by running targeted web crawlers (Scan- nell, 2007). As web crawling involves discarding As useful monolingual text corpora across lan- muchofthedownloadedcontent(Olston and Na- guages are highly relevant for the NLP community jork, 2010), especially link filtering and prioritiza- (Caswell et al., 2020), web corpora seem to be a tion can prove to be tricky for contexts in which natural way to gather language data. Corpus con- data collection is just the first step of a project, so struction usually involves “crawling, downloading, that time resources for this task are scarce. Data ‘cleaning’ and de-duplicating the data, then linguis- collection approaches using the CommonCrawl1 tically annotating it and loading it into a corpus have flourished as they allow for faster download query tool” (Kilgarriff, 2007). However, although and processing by skipping (or more precisely out- text is ubiquitous on the Web, drawing accurate sourcing) the crawling phase. Barring the fact that information from web pages can be difficult. In ad- finding one’s “own” way through the Web can be dition, the vastly increasing variety of corpora, text preferable, such data should not be used without types and use cases makes it more and more diffi- forethought and exhaustive filtering. Beside the dis- cult to assess the usefulness and appropriateness of covery of relevant websites, a major issue consists certain web texts for given research objectives. As in selecting appropriate content after download and a result, content adequacy, focus and quality need ¨ processing(Schaferetal.,2013),whichcanbecom- to be evaluated after the downloads (Baroni et al., 2009). 1https://commoncrawl.org 122 Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 122–131, August 1st - August 6th, 2021. ©2021Association for Computational Linguistics plex due to unexpected machine-generated flaws from a cleaner text base. In the concrete case of and biases. linguistic and lexicographic research, it allows for Finally, depending on the project’s jurisdiction, content queries on meaningful parts of the docu- legal aspects of retrieving and granting access ments. to web documents can be unclear or restrictive. The remainder of this article introduces a text Boundaries of copyright law are not clear when it extraction and web navigation tool published un- comes to corpus building (De Clercq and Perez, der open-source license. Its installation and use is 2010) so that some corpus infrastructure projects straightforward, notably from Python and on the leave it to users to decide what to do from a copy- command-line. The software makes it easier to ex- right standpoint (Benko, 2016). Copyright and tract the main text, comments and metadata, while intellectual property rights usually do not apply also providing building blocks for text discovery to resources such as language models or n-grams tasks such as web crawling. The following also (Buck et al., 2014), so are shuffled sentences (Bie- entails a comparative evaluation of text extraction mannetal., 2007). Web corpora focusing on man- on real-world data. The contributions of this paper ually selected sources under Creative Commons are thus threefold as it references the software, fea- licenses have been built (Brunello, 2009; Lyding tures a benchmark, and provides a fast, meaningful et al., 2014), although only a very small propor- baseline for similar tasks. ¨ tion of websites use them (Barbaresi and Wurzner, 2 State of the art 2014). Corporabasedonmachine-checkedlicenses havealsobeendeveloped(Habernaletal.,2016),as 2.1 “Adifficult IE problem” well as systems to merge annotation with web parts Even before the “Web 2.0” paradigm with web ¨ from the CommonCrawl (Schafer, 2016). Consid- pages assembling information from and for a va- ering the progresses of annotation tools, is can be riety of sources (notably the advertising industry), easier to retrieve documents directly from the Web webpageshavebeenknownfortheirlackoffocus or from archives and to process them to one’s taste. ondirectly usable text content. Despite the quantity 1.2 Research context of pages following an article format where there is This effort is part of methods to derive informa- a main text to be found, web pages now accessible tion from web documents in order to build text through archives cannot be expected to be easy to databases for a lexicographic information plat- process: “Articles published on the WWW often form (Geyken et al., 2017). Extracting and pre- contain extraneous clutter. Most articles consist processing web texts to the exacting standards of of a main body which constitutes the relevant part scientific research turned out to be a substantial of the particular page. [...] Identifying the main challenge where existing open-source solutions bodyofawebpageinageneralrobustmanneris were not entirely convincing in terms of accuracy, a difficult information extraction problem.” (Finn versatility, and ease of use. The current tool fol- et al., 2001) lows from earlier work on news and blog articles Web pages come in different shapes and sizes extraction (Barbaresi, 2015, 2016). Its packaging mostly because of the wide variety of platforms into a directly re-usable format generalizes the pro- and content management systems, and not least be- cess and makes it available to the community, with cause of varying reasons to publish and diverging thorough testing it has also become much more goals followed during web publication. Web page robust and versatile. structure is also constantly evolving from the per- spective of standards. HTML 5 was first released in 1.3 Contributions 2008 to provide support for multimedia and graph- Distinguishing between a whole page and the ical elements. This standard streamlined syntax page’s essential parts can help to alleviate many while retaining backward-compatibility. Web con- quality problems related to web text processing, no- tent extraction is also an active field of research in tably by dealing with the noise caused by recurring user experience, resulting from the need for higher elements (headers and footers, ads, links/blogroll, download and rendering speeds as well as from a etc.). This can be particularly useful to de-duplicate growing amount of “Web bloat” requiring the de- 2 recurring language samples. Tasks related to con- velopment of “reader modes” and “distillers” for tent extraction and language modeling also benefit 2https://chromium.googlesource.com/chromium/dom- 123 webbrowsers(Ghasemisharif et al., 2019). density have proven to be good indicators in order to select or discard content nodes, using the cu- 2.2 Wrappers mulative distribution of tags (Finn et al., 2001), or Data extraction has first been based on “wrappers” with approaches such as the content extraction via (now called “scrapers”) which were mostly rely- tag ratios (Weninger et al., 2010) and the content ing on manual design and tended to be brittle and extraction via text density algorithms (Sun et al., hard to maintain (Crescenzi et al., 2001). These 2011). Statistical selection of informative nodes extraction procedures have also been used early through a combination of both methods proved on by blogs search engines (Glance et al., 2004). moreefficientoncomparabledatasets(Qureshiand Since the genre of “web diaries” was established Memon,2012). The large majority of DOM-based before the blogs in Japan, there have been attempts approaches try to leverage semantic information to target not only blog software but also regular conveyedby HTMLtags,notablyparagraphs(p)on pages (Nanno et al., 2004), in which the extraction whichtext-to-tag ratios are calculated (Carey and of metadata also allows for a distinction based on Manic, 2016), or tag ratios and semantic features heuristics. Regarding metadata extraction for pages from id and class attributes (Peters and Lecocq, in article form and blogs in particular, common 2013). targets include the title of the entry, the date, the Machine learning approaches have also been author, the content, the number of comments, the used, whose interest generally consists in lever- archived link, and the trackback link (Glance et al., aging advances in classification tasks by treating a 2004); they can also aim at comments specifically HTMLdocumentasaseriesofblockstobeclassi- (Mishne and Glance, 2006). fied. Relevant algorithms include conditional ran- domfieldslearning header, text, and noisy blocks 2.3 Generic web content extraction with markup-based, content-based, and document- GenericextractiontechniquesgroundonDocument related features (Spousta et al., 2008), support vec- Object Model (DOM) examination. An earlier, tor machines trained on linguistic, structural and language-independent approach uses entropy mea- visual features (Bauer et al., 2007), Naive Bayes sures applied to features, links, and content in order (Pasternack and Roth, 2009), multi-layer percep- ¨ to discriminate among parts of a web page (Kao tron based on paragraph-level features (Schafer et al., 2004). Another notable technique, Visual and Bildhauer, 2012), or logistic regressions (Pe- Page Segmentation, applies heuristics to find vi- ters and Lecocq, 2013). More recently, deep learn- sually grouped blocks (Cai et al., 2003). Other ing has also been used for similar classifications, methods are based on style tree induction, that is e.g. the Web2Text system is based on convolutional detection of similarities of DOM trees on site-level neural networks learning combinations of DOM- (Yi et al., 2003; Vieira et al., 2006). Overall, efforts based features (Vogels et al., 2018). madetoautomaticallygeneratewrappershavebeen Despite the number of article on this topic, very centered on three different approaches (Guo et al., few systems are open-source or freely available 2010): wrapper induction (e.g. building a grammar (Alarte et al., 2019). to parse a web page), sequence labeling (e.g. la- beled examples or a schema of data in the page), 2.4 Corpuslinguistics and NLP and statistical analysis. This approach combined to There are few comparable projects coming from the inspection of DOM tree characteristics (Wang the linguistics or natural language processing com- et al., 2009; Guo et al., 2010) is a common ground munities and focused on making software publicly to the information retrieval and computational lin- available and usable. Boilerpipe uses shallow text guistics communities, with the categorization of features like word counts and link density with HTMLelements and linguistic features (Ziegler ¨ and Skubacz, 2007) for the former and boilerplate decision tree and SVM classifiers (Kohlschutter removal for the latter. et al., 2010). JusText is based on length heuristics ´ TheDOMconsidersagiven HTMLdocumentas as well as link and stop word densities (Pomikalek, a tree structure whose nodes represent parts of the 2011). Both algorithms have been prevalent since documenttobeoperatedon. Text, tag and/or link their release and are now mostly used through their subsequent forks, as software needs to be kept up- distiller to-date. More recent initiatives explicitly targeting 124 3 5 corpus creation feature the Corpus Crawler or retrieved from sources such as the CommonCrawl 4 ¨ 6 Texrex (Schafer, 2017), neither of which appears or the Internet Archive . to be actively maintained. In addition, download utilities are included, no- Anevaluation and discussion following from the tably using a multi-threaded but “polite” processing Cleaneval initiative (Baroni et al., 2008) would put of URLqueues, i.e. time restrictions based on do- the topic back into focus, as content processing on mainnames. Persistent connections are managed the Web is affected by both time and geography. by a connection pool, thus maintaining connec- This benchmark could be elaborated on, results are tions with websites to be scraped. The tool also not consistent in different languages and metrics entails web crawling capacities which provide ac- sometime fail to capture the variable influence of cessible and fail-safe ways to gather data based on extractors on downstream modules (Lejeune and a series of target sites. First, support for sitemaps Zhu, 2018). Often, tools are developed with partic- (XMLandTXTformats)accordingtothesitemap ular page styles in mind, mostly from the English- protocol. Second, support for web feeds (ATOM, speaking world (Barbaresi and Lejeune, 2020). For RDFandRSSformats)whichmakeitpossibleto certain projects, customized scrapers which are ad- build a seamless news crawler. Third, crawling justed to each website remain feasible (Krasselt components to discover content. It can also manip- et al., 2020). A generic approach can really save ulate URL lists, including filtering and prioritiza- humantimeandresources, albeit at a certain cost tion based on site characteristics or language-aware in terms of accuracy depending on the context. heuristics based on internationalization. The package provides a relatively light-weight 3 Introducing the Trafilatura tool and modular architecture, letting users choose the componentstheywishtoinclude. Ithasbeentested 3.1 Features onLinux, MacOSandWindows,andcanbeused Trafilatura is a web scraping tool for text discovery with Python, on the command-line, with R (us- and retrieval which seamlessly downloads, parses, ing the reticulate adapter package), and through a and scrapes web page data. It can crawl and dis- graphical user interface. The package documenta- 7 cover texts within a website and process them ac- tion also acts as a manual on web text collection. cordingly. The extractor focuses on metadata, main 3.2 Extraction process body text and comments while preserving parts of the text formatting and page structure. It aims The extraction combines two acknowledged li- 8 9 to be precise enough in order not to miss texts or braries, readability-lxml and jusText , which are to discard valid documents, as it must be robust used as safety nets and fallbacks. Trafilatura’s own but also reasonably fast. With these objectives in extraction algorithm is based on a cascade of rule- mind, Trafilatura is designed to run in production based filters and content heuristics: on millions of web documents. (1) Content delimitation is performed by XPath ex- Thesoftware features parallel online and offline pressions targeting common HTML elements and processing: URLs, HTML files or parsed HTML attributes as well as idiosyncrasies of main content trees can be used as input. Although straight out- management systems, first in a negative perspec- put of Python variables is possible, conversion to tive with the exclusion of unwanted parts of the various common output formats makes the soft- HTMLcode(e.g.) and next ware more versatile: plain text (minimal format- bycenteringonthedesirablecontent(e.g.). The same operations are JSON(with metadata), XML and XML-TEI (for performed for comments in case they are part of metadata and structure). The latter support for TEI the extraction. The selected nodes of the HTML format (following the recommendations of the Text tree are then processed, i.e. checked for relevance Encoding Initiative) also includes a validator for (notably by element type, text length and link den- Pythonwhichcanbeusedapartfromtheextraction. sity) and simplified as to their HTML structure. Thescraping and conversion parts also work with 5https://commoncrawl.org/ existing archives, Raw HTML documents can be 6https://archive.org/ 7https://trafilatura.readthedocs.io/ 3https://github.com/google/corpuscrawler 8https://github.com/buriy/python-readability 4https://github.com/rsling/texrex 9https://github.com/miso-belica/jusText 125
no reviews yet
Please Login to review.