Python Pdf Text Extraction 178981

Partial capture of text on file.
                          Traﬁlatura: A Web Scraping Library and Command-Line Tool
                                                for Text Discovery and Extraction
                                                               Adrien Barbaresi
                                            Center for Digital Lexicography of German (ZDL)
                                           Berlin-Brandenburg Academy of Sciences (BBAW)
                                                   Jgerstr. 22-23, 10117 Berlin, Germany
                                                           barbaresi@bbaw.de
                                       Abstract                               Asigniﬁcant challenge lies in the ability to ex-
                      Anessential operation in web corpus construc-        tract and pre-process web data to meet scientiﬁc
                      tion consists in retaining the desired content       expectations with respect to text quality. An es-
                      while discarding the rest. Another challenge         sential operation in corpus construction consists
                      ﬁnding one’s way through websites. This ar-          in retaining the desired content while discarding
                      ticle introduces a text discovery and extrac-        the rest, a task carrying various names referring to
                      tion tool published under open-source license.       speciﬁc subtasks or to pre-processing as a whole:
                      Its installation and use is straightforward, no-     webscraping, boilerplate removal, web page seg-
                      tably from Python and on the command-line.           mentation, web page cleaning, template extraction,
                      The software allows for main text, comments          or content extraction. This step is sometimes over-
                      and metadata extraction, while also providing        looked although it involves a series of design de-
                      building blocks for web crawling tasks.    A         cisions and turning points in data processing. De-
                      comparativeevaluationonreal-worlddataalso
                      showsitsinterestaswellastheperformanceof             pendingonthepurposeofdatacollection,adequate
                      other available solutions.                           ﬁltering and quality assessment can be crucial. It
                      Thecontributionsofthispaperarethreefold: it          has a signiﬁcant impact on a wide range of down-
                      references the software, features a benchmark,       stream applications like text analysis, information
                      andprovides a meaningful baseline for similar        retrieval, link analysis, page adaptation to other ter-
                      tasks. The tool performs signiﬁcantly better         minalsandscreens,andespeciallynaturallanguage
                      than other open-source solutions in this evalu-      processing pipelines.
                      ation and in external benchmarks.                       Another challenge is how to ﬁnd one’s way
                  1   Introduction                                         through the Web, notably as linguistic data are
                  1.1   Gathering texts from the Web                       gathered by running targeted web crawlers (Scan-
                                                                           nell, 2007). As web crawling involves discarding
                  As useful monolingual text corpora across lan-           muchofthedownloadedcontent(Olston and Na-
                  guages are highly relevant for the NLP community         jork, 2010), especially link ﬁltering and prioritiza-
                  (Caswell et al., 2020), web corpora seem to be a         tion can prove to be tricky for contexts in which
                  natural way to gather language data. Corpus con-         data collection is just the ﬁrst step of a project, so
                  struction usually involves “crawling, downloading,       that time resources for this task are scarce. Data
                  ‘cleaning’ and de-duplicating the data, then linguis-    collection approaches using the CommonCrawl1
                  tically annotating it and loading it into a corpus       have ﬂourished as they allow for faster download
                  query tool” (Kilgarriff, 2007). However, although        and processing by skipping (or more precisely out-
                  text is ubiquitous on the Web, drawing accurate          sourcing) the crawling phase. Barring the fact that
                  information from web pages can be difﬁcult. In ad-       ﬁnding one’s “own” way through the Web can be
                  dition, the vastly increasing variety of corpora, text   preferable, such data should not be used without
                  types and use cases makes it more and more difﬁ-         forethought and exhaustive ﬁltering. Beside the dis-
                  cult to assess the usefulness and appropriateness of     covery of relevant websites, a major issue consists
                  certain web texts for given research objectives. As      in selecting appropriate content after download and
                  a result, content adequacy, focus and quality need                        ¨
                                                                           processing(Schaferetal.,2013),whichcanbecom-
                  to be evaluated after the downloads (Baroni et al.,
                  2009).                                                      1https://commoncrawl.org
                                                                       122
                 Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
            International Joint Conference on Natural Language Processing: System Demonstrations, pages 122–131, August 1st - August 6th, 2021.
                                                   ©2021Association for Computational Linguistics
                 plex due to unexpected machine-generated ﬂaws           from a cleaner text base. In the concrete case of
                 and biases.                                             linguistic and lexicographic research, it allows for
                    Finally, depending on the project’s jurisdiction,    content queries on meaningful parts of the docu-
                 legal aspects of retrieving and granting access         ments.
                 to web documents can be unclear or restrictive.           The remainder of this article introduces a text
                 Boundaries of copyright law are not clear when it       extraction and web navigation tool published un-
                 comes to corpus building (De Clercq and Perez,          der open-source license. Its installation and use is
                 2010) so that some corpus infrastructure projects       straightforward, notably from Python and on the
                 leave it to users to decide what to do from a copy-     command-line. The software makes it easier to ex-
                 right standpoint (Benko, 2016). Copyright and           tract the main text, comments and metadata, while
                 intellectual property rights usually do not apply       also providing building blocks for text discovery
                 to resources such as language models or n-grams         tasks such as web crawling. The following also
                 (Buck et al., 2014), so are shufﬂed sentences (Bie-     entails a comparative evaluation of text extraction
                 mannetal., 2007). Web corpora focusing on man-          on real-world data. The contributions of this paper
                 ually selected sources under Creative Commons           are thus threefold as it references the software, fea-
                 licenses have been built (Brunello, 2009; Lyding        tures a benchmark, and provides a fast, meaningful
                 et al., 2014), although only a very small propor-       baseline for similar tasks.
                                                              ¨
                 tion of websites use them (Barbaresi and Wurzner,       2   State of the art
                 2014). Corporabasedonmachine-checkedlicenses
                 havealsobeendeveloped(Habernaletal.,2016),as            2.1   “Adifﬁcult IE problem”
                 well as systems to merge annotation with web parts      Even before the “Web 2.0” paradigm with web
                                                ¨
                 from the CommonCrawl (Schafer, 2016). Consid-           pages assembling information from and for a va-
                 ering the progresses of annotation tools, is can be     riety of sources (notably the advertising industry),
                 easier to retrieve documents directly from the Web      webpageshavebeenknownfortheirlackoffocus
                 or from archives and to process them to one’s taste.    ondirectly usable text content. Despite the quantity
                 1.2   Research context                                  of pages following an article format where there is
                 This effort is part of methods to derive informa-       a main text to be found, web pages now accessible
                 tion from web documents in order to build text          through archives cannot be expected to be easy to
                 databases for a lexicographic information plat-         process: “Articles published on the WWW often
                 form (Geyken et al., 2017). Extracting and pre-         contain extraneous clutter. Most articles consist
                 processing web texts to the exacting standards of       of a main body which constitutes the relevant part
                 scientiﬁc research turned out to be a substantial       of the particular page. [...] Identifying the main
                 challenge where existing open-source solutions          bodyofawebpageinageneralrobustmanneris
                 were not entirely convincing in terms of accuracy,      a difﬁcult information extraction problem.” (Finn
                 versatility, and ease of use. The current tool fol-     et al., 2001)
                 lows from earlier work on news and blog articles          Web pages come in different shapes and sizes
                 extraction (Barbaresi, 2015, 2016). Its packaging       mostly because of the wide variety of platforms
                 into a directly re-usable format generalizes the pro-   and content management systems, and not least be-
                 cess and makes it available to the community, with      cause of varying reasons to publish and diverging
                 thorough testing it has also become much more           goals followed during web publication. Web page
                 robust and versatile.                                   structure is also constantly evolving from the per-
                                                                         spective of standards. HTML 5 was ﬁrst released in
                 1.3   Contributions                                     2008 to provide support for multimedia and graph-
                 Distinguishing between a whole page and the             ical elements. This standard streamlined syntax
                 page’s essential parts can help to alleviate many       while retaining backward-compatibility. Web con-
                 quality problems related to web text processing, no-    tent extraction is also an active ﬁeld of research in
                 tably by dealing with the noise caused by recurring     user experience, resulting from the need for higher
                 elements (headers and footers, ads, links/blogroll,     download and rendering speeds as well as from a
                 etc.). This can be particularly useful to de-duplicate  growing amount of “Web bloat” requiring the de-
                                                                                                                       2
                 recurring language samples. Tasks related to con-       velopment of “reader modes” and “distillers” for
                 tent extraction and language modeling also beneﬁt          2https://chromium.googlesource.com/chromium/dom-
                                                                     123
                  webbrowsers(Ghasemisharif et al., 2019).                   density have proven to be good indicators in order
                                                                             to select or discard content nodes, using the cu-
                  2.2    Wrappers                                            mulative distribution of tags (Finn et al., 2001), or
                  Data extraction has ﬁrst been based on “wrappers”          with approaches such as the content extraction via
                  (now called “scrapers”) which were mostly rely-            tag ratios (Weninger et al., 2010) and the content
                  ing on manual design and tended to be brittle and          extraction via text density algorithms (Sun et al.,
                  hard to maintain (Crescenzi et al., 2001). These           2011). Statistical selection of informative nodes
                  extraction procedures have also been used early            through a combination of both methods proved
                  on by blogs search engines (Glance et al., 2004).          moreefﬁcientoncomparabledatasets(Qureshiand
                  Since the genre of “web diaries” was established           Memon,2012). The large majority of DOM-based
                  before the blogs in Japan, there have been attempts        approaches try to leverage semantic information
                  to target not only blog software but also regular          conveyedby HTMLtags,notablyparagraphs(p)on
                  pages (Nanno et al., 2004), in which the extraction        whichtext-to-tag ratios are calculated (Carey and
                  of metadata also allows for a distinction based on         Manic, 2016), or tag ratios and semantic features
                  heuristics. Regarding metadata extraction for pages        from id and class attributes (Peters and Lecocq,
                  in article form and blogs in particular, common            2013).
                  targets include the title of the entry, the date, the         Machine learning approaches have also been
                  author, the content, the number of comments, the           used, whose interest generally consists in lever-
                  archived link, and the trackback link (Glance et al.,      aging advances in classiﬁcation tasks by treating a
                  2004); they can also aim at comments speciﬁcally           HTMLdocumentasaseriesofblockstobeclassi-
                  (Mishne and Glance, 2006).                                 ﬁed. Relevant algorithms include conditional ran-
                                                                             domﬁeldslearning header, text, and noisy blocks
                  2.3    Generic web content extraction                      with markup-based, content-based, and document-
                  GenericextractiontechniquesgroundonDocument                related features (Spousta et al., 2008), support vec-
                  Object Model (DOM) examination. An earlier,                tor machines trained on linguistic, structural and
                  language-independent approach uses entropy mea-            visual features (Bauer et al., 2007), Naive Bayes
                  sures applied to features, links, and content in order     (Pasternack and Roth, 2009), multi-layer percep-
                                                                                                                               ¨
                  to discriminate among parts of a web page (Kao             tron based on paragraph-level features (Schafer
                  et al., 2004). Another notable technique, Visual           and Bildhauer, 2012), or logistic regressions (Pe-
                  Page Segmentation, applies heuristics to ﬁnd vi-           ters and Lecocq, 2013). More recently, deep learn-
                  sually grouped blocks (Cai et al., 2003). Other            ing has also been used for similar classiﬁcations,
                  methods are based on style tree induction, that is         e.g. the Web2Text system is based on convolutional
                  detection of similarities of DOM trees on site-level       neural networks learning combinations of DOM-
                  (Yi et al., 2003; Vieira et al., 2006). Overall, efforts   based features (Vogels et al., 2018).
                  madetoautomaticallygeneratewrappershavebeen                   Despite the number of article on this topic, very
                  centered on three different approaches (Guo et al.,        few systems are open-source or freely available
                  2010): wrapper induction (e.g. building a grammar          (Alarte et al., 2019).
                  to parse a web page), sequence labeling (e.g. la-
                  beled examples or a schema of data in the page),           2.4   Corpuslinguistics and NLP
                  and statistical analysis. This approach combined to        There are few comparable projects coming from
                  the inspection of DOM tree characteristics (Wang           the linguistics or natural language processing com-
                  et al., 2009; Guo et al., 2010) is a common ground         munities and focused on making software publicly
                  to the information retrieval and computational lin-        available and usable. Boilerpipe uses shallow text
                  guistics communities, with the categorization of           features like word counts and link density with
                  HTMLelements and linguistic features (Ziegler                                                               ¨
                  and Skubacz, 2007) for the former and boilerplate          decision tree and SVM classiﬁers (Kohlschutter
                  removal for the latter.                                    et al., 2010). JusText is based on length heuristics
                                                                                                                               ´
                     TheDOMconsidersagiven HTMLdocumentas                    as well as link and stop word densities (Pomikalek,
                  a tree structure whose nodes represent parts of the        2011). Both algorithms have been prevalent since
                  documenttobeoperatedon. Text, tag and/or link              their release and are now mostly used through their
                                                                             subsequent forks, as software needs to be kept up-
                  distiller                                                  to-date. More recent initiatives explicitly targeting
                                                                         124
                                                                  3                                                          5
                  corpus creation feature the Corpus Crawler or           retrieved from sources such as the CommonCrawl
                        4      ¨                                                                  6
                  Texrex (Schafer, 2017), neither of which appears        or the Internet Archive .
                  to be actively maintained.                                In addition, download utilities are included, no-
                    Anevaluation and discussion following from the        tably using a multi-threaded but “polite” processing
                  Cleaneval initiative (Baroni et al., 2008) would put    of URLqueues, i.e. time restrictions based on do-
                  the topic back into focus, as content processing on     mainnames. Persistent connections are managed
                  the Web is affected by both time and geography.         by a connection pool, thus maintaining connec-
                  This benchmark could be elaborated on, results are      tions with websites to be scraped. The tool also
                  not consistent in different languages and metrics       entails web crawling capacities which provide ac-
                  sometime fail to capture the variable inﬂuence of       cessible and fail-safe ways to gather data based on
                  extractors on downstream modules (Lejeune and           a series of target sites. First, support for sitemaps
                  Zhu, 2018). Often, tools are developed with partic-     (XMLandTXTformats)accordingtothesitemap
                  ular page styles in mind, mostly from the English-      protocol. Second, support for web feeds (ATOM,
                  speaking world (Barbaresi and Lejeune, 2020). For       RDFandRSSformats)whichmakeitpossibleto
                  certain projects, customized scrapers which are ad-     build a seamless news crawler. Third, crawling
                  justed to each website remain feasible (Krasselt        components to discover content. It can also manip-
                  et al., 2020). A generic approach can really save       ulate URL lists, including ﬁltering and prioritiza-
                  humantimeandresources, albeit at a certain cost         tion based on site characteristics or language-aware
                  in terms of accuracy depending on the context.          heuristics based on internationalization.
                                                                            The package provides a relatively light-weight
                  3   Introducing the Traﬁlatura tool                     and modular architecture, letting users choose the
                                                                          componentstheywishtoinclude. Ithasbeentested
                  3.1   Features                                          onLinux, MacOSandWindows,andcanbeused
                  Traﬁlatura is a web scraping tool for text discovery    with Python, on the command-line, with R (us-
                  and retrieval which seamlessly downloads, parses,       ing the reticulate adapter package), and through a
                  and scrapes web page data. It can crawl and dis-        graphical user interface. The package documenta-
                                                                                                                             7
                  cover texts within a website and process them ac-       tion also acts as a manual on web text collection.
                  cordingly. The extractor focuses on metadata, main      3.2   Extraction process
                  body text and comments while preserving parts
                  of the text formatting and page structure. It aims      The extraction combines two acknowledged li-
                                                                                                   8             9
                  to be precise enough in order not to miss texts or      braries, readability-lxml and jusText , which are
                  to discard valid documents, as it must be robust        used as safety nets and fallbacks. Traﬁlatura’s own
                  but also reasonably fast. With these objectives in      extraction algorithm is based on a cascade of rule-
                  mind, Traﬁlatura is designed to run in production       based ﬁlters and content heuristics:
                  on millions of web documents.                           (1) Content delimitation is performed by XPath ex-
                    Thesoftware features parallel online and ofﬂine       pressions targeting common HTML elements and
                  processing: URLs, HTML ﬁles or parsed HTML              attributes as well as idiosyncrasies of main content
                  trees can be used as input. Although straight out-      management systems, ﬁrst in a negative perspec-
                  put of Python variables is possible, conversion to      tive with the exclusion of unwanted parts of the
                  various common output formats makes the soft-           HTMLcode(e.g. ) and next
                  ware more versatile: plain text (minimal format-        bycenteringonthedesirablecontent(e.g. ). The same operations are
                  JSON(with metadata), XML and XML-TEI (for               performed for comments in case they are part of
                  metadata and structure). The latter support for TEI     the extraction. The selected nodes of the HTML
                  format (following the recommendations of the Text       tree are then processed, i.e. checked for relevance
                  Encoding Initiative) also includes a validator for      (notably by element type, text length and link den-
                  Pythonwhichcanbeusedapartfromtheextraction.             sity) and simpliﬁed as to their HTML structure.
                  Thescraping and conversion parts also work with            5https://commoncrawl.org/
                  existing archives, Raw HTML documents can be               6https://archive.org/
                                                                             7https://traﬁlatura.readthedocs.io/
                     3https://github.com/google/corpuscrawler                8https://github.com/buriy/python-readability
                     4https://github.com/rsling/texrex                       9https://github.com/miso-belica/jusText
                                                                      125
The words contained in this file might help you see if this file matches what you are looking for:

...Tralatura a web scraping library and command line tool for text discovery extraction adrien barbaresi center digital lexicography of german zdl berlin brandenburg academy sciences bbaw jgerstr germany de abstract asignicant challenge lies in the ability to ex anessential operation corpus construc tract pre process data meet scientic tion consists retaining desired content expectations with respect quality an es while discarding rest another sential construction nding one s way through websites this ar ticle introduces extrac task carrying various names referring published under open source license specic subtasks or processing as whole its installation use is straightforward no webscraping boilerplate removal page seg tably from python on mentation cleaning template software allows main comments step sometimes over metadata also providing looked although it involves series design building blocks crawling tasks cisions turning points comparativeevaluationonreal worlddataalso showsitsint...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area