153x Filetype PDF File size 0.25 MB Source: ccc.inaoep.mx
Knowledge-Based Systems 17 (2004) 219–227 www.elsevier.com/locate/knosys Amultilingual text mining approach to web cross-lingual text retrieval * Rowena Chau , Chung-Hsing Yeh School of Business Systems, Faculty of Information Technology, Monash University, Clayton, Vic. 3800, Australia Received 26 August 2003; accepted 6 April 2004 Available online 28 May 2004 Abstract To enable concept-based cross-lingual text retrieval (CLTR) using multilingual text mining, our approach will first discover the multilingual concept–termrelationshipsfromlinguisticallydiversetextualdatarelevanttoadomain.Second,themultilingualconcept–term relationships, in turn, are used to discover the conceptual content of the multilingual text, which is either a document containing potentially relevant information or a query expressing an information need. When language-independent concepts hidden beneath both document and query are revealed, concept-based matching is made possible. Hence, concept-based CLTR is facilitated. This approach is employed for developing a multi-agent system to facilitate concept-based CLTR on the Web. q2004Elsevier B.V. All rights reserved. Keywords: Multilingual text mining; Cross-lingual text retrieval; Agent; Fuzzy clustering; Fuzzy classification 1. Introduction Documents and queries about the same concept do not necessarily contain matching sets of translation equivalents TheexponentialgrowthoftheWorldWideWeboverthe ofeachother.Conceptualrelevancebetweendocumentsand globe is the most influential factor that contributes to the queries is not to be determined in an explicit way. To realize increasing awareness of cross-lingual text retrieval (CLTR) concept-based CLTR, the development of a conceptual in recent years. Relevant information exists in different interlingua to support lexical transfer across multiple languages. A user may want to find documents in languages languages is required. To encode a conceptual interlingua, other than the one the query is formulated in. Among terms from multiple languages describing the same concept various CLTR techniques developed recently, query should be mapped to a language-independent scheme. In translation is the most extensively studied one. Such this way, it is possible to match a term to its corresponding CLTR approaches are developed mainly to facilitate term- counterparts in all other languages and to achieve concept- based lexical transfer between a single pair of source and based CLTR. target languages. However, a bilingual lexical transfer is not Multilingual thesaurus (e.g. EuroWordNet) encoding sufficient for fully supporting the user’s need of multilingual conceptual relationship among multilingual terms is such a information seeking. conceptual interlingua that has been used to achieve this Within a multilingual information community, users goal [7]. However, the manual construction of multilingual often rely on CLTR to explore global knowledge relevant to thesauri is very labor expensive and their coverage is not a certain topic/area. Instead of looking for some specific domain specific. An automatic and possibly unsupervised documents that can be characterized by a few translation approach for generating such linguistic knowledge for equivalents of the query terms, users are often interested in a CLTR by discovering structures of lexical relationships broader view of a particular domain. They are thinking in among multilingual terms from analyzing text of relevant terms of concepts and expecting to receive all relevant domain is highly desirable. documentsexisting in any language. In such cases, concept- based CLTR capable of identifying multilingual documents To provide better support to CLTR, a knowledge about the concept of a query is necessary. discovery technology, known as text mining, looks promising in discovering such kind of in-depth multilingual * Corresponding author. linguistic knowledge. Typically, text mining concerns the E-mail address: rowena.chau@infotech.monash.edu.au (R. Chau). discovery and extraction of hidden relationships, such as 0950-7051/$ - see front matter q 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2004.04.001 220 R. Chau, C.-H. Yeh / Knowledge-Based Systems 17 (2004) 219–227 conceptual associations, among textual items, including written in multiple languages. Corpus-based query trans- terms and documents. lation is based on the idea that terms are represented as To enable concept-based CLTR using multilingual text points in a multi-dimensional semantic space, and terms (in mining, our approach will first discover the multilingual different languages) mapped to the same set of points in that concept–term relationships from linguistically diverse semantic space are used to describe the same concept. textual data relevant to a domain. Second, the multilingual Geometric relationships between terms within the semantic concept–termrelationships, in turn, are used to discover the space are automatically extracted by analyzing co-occur- conceptual content of the multilingual text, which can be rence statistics of terms across a parallel corpus. By either a document containing potentially relevant infor- substituting every query term with its geometrically close mation or a query expressing an information need. When translations in the semantic space, query translation is then language-independent concepts hidden beneath both docu- facilitated [6,12]. The corpus-based approach is most ments and queries are revealed, concept-based matching is effective for CLTR when the document collection is made possible, thus facilitating concept-based CLTR. This domain-specific. In this paper, a corpus-based approach to approach is employed for developing a multi-agent system CLTRthatapplies multilingual text mining using a parallel to facilitate concept-based CLTR on the Web. corpus is proposed. 2. Current CLTR techniques 3. A multilingual text mining approach to cross-lingual text retrieval Given a query expressed in one language, the objective of CLTR is to search for relevant documents in other Our work for enabling CLTR with multilingual text languages. To break the language barrier, either document mining is focused on exploiting the knowledge discovery or query translation is required. As query translation is less capability of text mining over multilingual text. This is a resource demanding than document translation, it has logical approach due to the complementary nature of these proven to be a more feasible approach to CLTR. There twoareas. Both CLTR andmultilingual text mining analyze are three major approaches to query translation: (a) machine multilingual textual data employing techniques from translation, (b) knowledge-based methods using machines- information retrieval, natural language processing and readable dictionary [2,8], and (c) corpus-based methods machine learning. In terms of the functions they perform, using parallel corpus [14]. CLTR facilitates multilingual information access while Despite translating query using machine translation multilingual text mining enables knowledge discovery from being straightforward, it is argued that machine translation multilingual texts. The objective of CLTR is to locate and CLTR have divergent concerns [13]. Machine trans- relevant documents from a multilingual document collec- lation aiming at syntactically accurate translation is tion in response to a query represented by a set of terms, redundant to CLTR. Since query is short, grammatically while the objective of multilingual text mining is to reveal invalid and is just formulated with a few terms, it offers little concepts and their relationships embedded within a collec- context for the machine translation system to translate tion of multilingual texts. To determine the conceptual accurately. Besides, machine translation always replaces the relevance between documents and a query written in original query term with only one of its many possible different languages, CLTR requires understanding of their synonymous translations in the target language. This semantics. Multilingual text mining has the potential to prevents a query expansion by which all synonymous complement CLTR by discovering intrinsic meanings of terms are considered to enhance recall. multilingual texts. Our approach to concept-based CLTR Query can easily be translated by replacing every query with multilingual text mining is depicted in Fig. 1. termwithasetofallitspossibletranslations as encoded in a Within an integrated framework, multilingual text machine-readable dictionary. However, this approach is mining yields knowledge that supports CLTR. First, the ineffective mainly due to the translation ambiguity of multilingual concept–term relationships, which are necess- polysemous terms (i.e. terms with multiple meanings). A ary for a CLTR system to associate documents and query polysemous term may have several alternative translations across languages, are mined from a parallel corpus. This is carrying different senses (meanings) in any foreign achieved by a fuzzy multilingual term clustering algorithm. language. Translating a query by including every possible By grouping conceptually related multilingual terms into translation of every query term can greatly increase the set clusters, the multilingual concept–term relationships are of possible meanings in the translated query, thus revealed. Second, using the conceptual relationship among contributing to poor precision. Moreover, inadequate multilingual terms discovered in the previous step as the coverage of specific terminology and phrases is also a linguistic knowledge base, conceptual content exhibiting serious shortcoming of such machine-readable dictionary. ideas hidden beneath the multilingual texts is also mined. Analternative to machine-readable dictionary is using a Thisisfacilitated by a fuzzy multilingual text categorization parallel corpus. A parallel corpus is a set of identical text algorithm. As a result, both documents and query in R. Chau, C.-H. Yeh / Knowledge-Based Systems 17 (2004) 219–227 221 Fig. 1. A multilingual text mining approach to concept-based CLTR. different languages can then be encoded with language- a concept-oriented frame of lexical reference. A cluster of independent concepts, instead of language-specific terms. conceptually related multilingual terms helps enormously in As such, concept-based matching is made possible and focusing solely on relevant lexical alternatives by establish- concept-based CLTR is facilitated. ing a virtual semantic domain. Clustering is an unsupervised method for automatic class 3.1. Mining the conceptual relationship formation. It offers the advantage that a priori knowledge of of multilingual terms classes is not required. Typically, clustering algorithms (e.g. k-means) [9] aim to maximize inter-clustering distance and Successful application of text mining in supporting minimizeintra-clusterdistancesofsomesimilaritymeasure. monolingual information retrieval has been well reported In the context of mining conceptual relationships among [1]. To facilitate CLTR, our first multilingual text mining multilingual terms, clustering looks at building up clusters task is to discover the conceptual relationships among of semantically related multilingual terms. multilingual terms. Towards this end, a fuzzy multilingual As concepts tend to overlap in terms of meaning, crisp term clustering algorithm is developed using a fuzzy clustering algorithms like k-means that generate partitions clustering technique, known as fuzzy c-means [3]. Its such that each term is assigned to exactly one cluster is purpose is to generate a partition of a set of multilingual inadequate for representing the real textual data structure. In terms for revealing their concept–term relationships with this aspect, fuzzy clustering methods that allow objects additional concept membership degrees. Application of the (terms)tobeclassifiedtomorethanoneclusterwithdifferent multilingual term clustering algorithm thus results in a membership values are more appropriate. With the appli- collection of concepts represented by clusters of concep- cation of fuzzy c-means, the resulting fuzzy multilingual tually related multilingual terms. This collection of clusters, term clusters, which are overlapping, will provide a more analogous to a multilingual thesaurus, represents a com- realistic representation of the multilingual semantic space. pression and reflection of the usage of multiple languages. The fuzzy c-means algorithm aims at minimizing the P P objective function JðX;U;vÞ¼ c n m 2 Its importance in concept-based CLTR is in providing i¼1 k¼1 ðmikÞ d ðvi;xkÞ 222 R. Chau, C.-H. Yeh / Knowledge-Based Systems 17 (2004) 219–227 P under the constraints n m .0foralli[{1;…;c}and and k ¼ 1;…;K randomly such that k¼1 ik Pc m ¼1foralli[{1;…;c}whereX¼{x ;…;x }#Rp i¼1 ik 1 n c is the set of objects; c the number of fuzzy clusters; m [ X ik mik ¼ 1 ;k ¼ 1;…;K ð1Þ ½0;1 the membership degree of object xk to cluster i; vi the i¼1 prototype (cluster center) of cluster i, and dðv ;x Þ the i k and Euclidean distance between prototype vi and object xk: Theparameter m . 1is the fuzziness index. For m ! 1; the mik [ ½0;1 ;i ¼ 1;…c; ;k ¼ 1;…k ð2Þ clusters tend to be crisp, i.e. either m !1orm !0;for ik ik m!1;m !1=c: ik 2. Calculate the concept prototype (cluster centers) v ; using On the basis of the objective function optimization, i these membership values m : fuzzy c-means is most suitable for finding optimal ik groupings of objects that best represent the structure of XK ðmikÞmxk the data set. By minimizing the sum of within-group v ¼ k¼1 ; ;i ¼ 1;…;c ð3Þ i XK m variance, the strength of associations of objects is k¼1 ðmikÞ maximized within clusters and minimized between clusters. In this aspect, fuzzy c-means is particularly new useful in text mining applications, such as term clustering, 3. Calculate the new membership values mik using these where intrinsic conceptual structure and semantic relation- cluster centers vi : ships among terms must be revealed in order to gain new 1 m ¼ ; ik ! knowledge for better text categorization and retrieval. c 2=ðm21Þ Statistical analysis of parallel corpus has been proven to X kvi2xkk ð4Þ be an effective means of extracting useful multilingual j¼1 kvj 2 xkk lexical knowledge for CLTR and this has been successfully applied to the development of translation models for CLTR ;i ¼ 1;…;c; ;k ¼ 1;…;K [12]. Text in parallel translation is increasingly available as a result of the global explosion of the World Wide Web. new new Toward using the World Wide Web as a source of parallel 4. If km 2mk.1; let m¼m and go to step 2. Otherwise, stop. text, effective techniques for automatically identifying 5. Concept labeling. As a result of clustering, every parallel translated documents on the Web have also been multilingual term is assigned to various concepts developed [4,15]. (clusters) with various membership values. To apply Based on the hypothesis that semantically related these found clusters as a multilingual concept directory, multilingual terms representing similar concepts tend to concepts can be labeled by giving meaningful tags. This co-occur with similar inter- and intra-document frequencies can be done manually using expert knowledge or by across a parallel corpus, fuzzy c-means can be applied to selecting the term being assigned the highest member- sort a set of multilingual terms into clusters (concepts) such ship in each cluster for every language involved. As a that terms belonging to any one of the clusters (concepts) result, a fuzzy partition of the multilingual term space should be as similar as possible while terms of different acting as a multilingual linguistic knowledge base is now clusters (concepts) are as dissimilar as possible in terms of available for mining the conceptual content of all the concepts they represent. multilingual text. To realize the idea of mining the multilingual concept– term relationship using fuzzy c-means, a fuzzy multilingual 3.2. Mining the conceptual content of multilingual text term clustering algorithm is developed. To begin with, a set of multilingual terms, which are the objects to be clustered, Aiming at discovering the conceptual content of both is first extracted from a parallel corpus of N parallel multilingual document and query, our second multilingual documents. Each term is then represented as an input vector text mining task concerns the mapping of multilingual text of N features where each of the N parallel documents is to concepts This process is considered a text categorization regarded as an input feature with each feature value task. representing the frequency of that term in the nth parallel Text categorization is conducted based on the cluster document. Details of the fuzzy multilingual term clustering hypothesis [16], which states that documents with similar algorithm is presented as follows: contents are relevant to the same concept. To accomplish The fuzzy multilingual term clustering algorithm: the task, the crisp k-nearest neighbor algorithm [5] is among the most widely used method [11,17]. It determines the membership of an unclassified text d to a concept c by 1. Initialize the membership values mik of the k multilingual examining whether the k pre-classified texts, which are termsx toeachoftheiconcepts(clusters)fori ¼ 1;…;c k closest to d have also been classified to c.
no reviews yet
Please Login to review.