jagomart
digital resources
picture1_Language Pdf 101342 | Paper4


 101x       Filetype PDF       File size 0.20 MB       Source: www.cle.org.pk


File: Language Pdf 101342 | Paper4
proceedings of the conference on language technology 2009 hindi to urdu conversion beyond simple transliteration bushra jawaid tafseer ahmed university of malta malta universitaet konstanz germany bushrajd84 hotmail com tafseer ...

icon picture PDF Filetype PDF | Posted on 22 Sep 2022 | 3 years ago
Partial capture of text on file.
                                             Proceedings of the Conference on Language & Technology 2009 
                                 Hindi to Urdu Conversion: Beyond Simple Transliteration 
                    
                                                                  Bushra Jawaid, Tafseer Ahmed 
                                              University Of Malta, Malta, Universitaet Konstanz, Germany 
                                                     bushrajd84@hotmail.com, tafseer@gmail.com 
                                                                              
                    
                                         Abstract                               text comfortably or we need to deal the issues which 
                                                                                are beyond transliteration.  
                      This  paper  incorporates  a  detailed  analysis  of        
                   existing work on Hindi to Urdu transliteration systems       2.   Hindi-Urdu  transliteration:  a  brief 
                   and finds the enhancements they required.  It lists the      review 
                   issues  that  are  beyond  the  scope  of  character  by       
                   character mapping. The issues include multiple same             It  has  already  mentioned  that  both  languages  use 
                   sound Urdu characters against one Hindi character.           different  scripts  for  writing.  Here  we  discuss  these 
                   Moreover, it deals with the issues when the same word        scripts briefly. 
                   or words are written in two different ways. The paper           Hindi is written in devnagri script and it is read and 
                   lists  the  differences  in  pronunciation,  spelling  and   written  from  left  to  right.  All  consonants  in  Hindi 
                   writing style.  It presents solution to these issues that    inherit [] sound. All the vowels in Hindi are attached 
                   goes beyond transliteration.                                 to  the  top  or  bottom  of  the  consonant  or  to  an  [] 
                    
                   1. Introduction                                              vowel sign attached to the right of the consonant, with 
                                                                                the exception of the [] vowel sign, which is attached 
                      Urdu and Hindi are considered as different styles of      on  the  left  [5].  Hindi  has  29  non-aspirated,  and  15 
                   the  same  language.  These  languages  share  grammar       aspirated  consonants,  and  11  vowels  (svara)  [2].  A 
                   and differ in vocabulary and writing script. Urdu uses       syllable (akshara) is formed by the combination of zero 
                   more  Arabic  and  Persian  words  and  is  written  in      or one consonants and one vowel. [5] 
                   Nastaleeq  script.  Nevertheless,  Hindi  uses  more            Nastalique script is read and written from right-to-
                   Sanskrit words and is written in Devnagri script.            left.  Nastalique,  a  cursive,  context-sensitive  and  a 
                      In  conversation,  Urdu  and  Hindi  are  intelligible.   highly complex writing system, is widely used for the 
                   Television programs and cinema films are watched in          Urdu orthography. The shape assumed by a character 
                   the  both  languages  communities  without the need of       in  a  word  is  context  sensitive.  The  Urdu  alphabet 
                   translation. A Pakistani Urdu speaker understands the        contains    35   simple    consonants,    15   aspirated 
                   Indian  Hindi  films,  and  an  Indian  Hindi  speaker       consonants,  one  character  for  nasal  sound,  15 
                   understands Urdu programs. The problem arises when           diacritical marks, 10 digits and other symbols. [2] 
                   a person tries to read written text of the other language.      Below  is  the  consonant  chart  for  Hindi  and  its 
                   Most  of  the  people  cannot  read  script  of  the  other  respective Urdu character. 
                   language.                                                        
                      A considerable amount of work is done on Hindi to                Table 1: Mapping of Hindi and Urdu 
                   Urdu  transliteration.  CRULP  [1]  and  Malik  [2]  has                          consonants 
                   discussed  and  implemented  issues  of  Hindi  to  Urdu      
                   transliteration. There are two fundamental goals of this 
                   paper.  The  first  goal  is  to  find  problems  /  short    Devnagri Consonants           Urdu Consonants 
                   comings in the models / implementations of [1] and 
                   [2],  and to propose solutions of these problems. The           Letter       Name        Letter          Name 
                   second goal is to find whether any accurate character 
                   by  character  Hindi  to  Urdu  transliteration  will  be                     KA                       Kaaf 
                   enough for the Urdu reader to read transliterated Hindi 
                                                                            24 
                               Proceedings of the Conference on Language & Technology 2009 
                       KHA            Kaaf-Hay                BA               Bay 
                        GA              Gaaf                   BHA            Bay-Hay 
                       GHA            Gaaf-Hay                MA              Meem 
                       NGA              Noon                   YA             Bari-Yeh 
                 	       CA              Chay                   RA                Ray 
                 
      CHA      	      Chay-Hay                LA       !        Laam 
                        JA       
       Jeem                    VA        "       Wow 
                       JHA            Jeem-Hay         !       SHA      #       Sheen 
                 
      NYA       
                     "       SSA      #       Sheen 
                       TTA               Ttay                          $%    Seen/ Saay/ 
                                                           #       SA       &        Suad 
                       TTHA           Ttay-Hay                        '()(
                                                           $       HA                 Hay 
                       DDA              Ddaal        
                                                          Below is the Vowel chart for Hindi and Urdu. 
                       DDHA          Ddaal-Hay      
                                                         Table 2: Mapping of Hindi and Urdu vowels 
                       NNA                             
                                                            Hindi Vowels        Urdu Vowels 
                        TA          Tay/ Toay               Diacritical 
                                                          Letter              Letter    Vowel 
                                                                    Mark 
                       THA            Tay-Hay                               
                                                           %                               
                        DA              Daal            &                  *        ǡ 
                       DHA            Daal-Hay          '                 ِ         
                        NA              Noon             (        )         +        i 
                        PA               Pay             *        +         ُ         
                       PHA            Pay-Hay          ,         -                 u 
                                                     25 
                                              Proceedings of the Conference on Language & Technology 2009 
                                                     ِ + ر                      differences  in  writing  style  and  vocabulary.  We  are 
                         .              /       (consonant+                  presenting list of all these issues in the following text. 
                                                    vowel)0                       Most of the Hindi presented in the following discussion 
                                                                                  is  taken from [3] and [4] and few example sentences 
                                                                               are constructed. In section 4, we will present solutions 
                         1              2                            e           for these issues.  
                                                          ,                           
                         3              4                     æ            3.1. Transliteration between different scripts 
                                                                                   
                        5              6                           o              After transliterating the Hindi text by exploiting the 
                                                                                  CRULP’s transliterator and later on comparing those 
                        7              8              "                      results with the expected output of Malik’s system, we 
                                                                                  found following issues that remained unsolved in either 
                                                                                  one or both of the systems.  
                      In  Table  2  we  have  listed  all  the  Urdu  vowel        
                   symbols or group of vowels against each Hindi vowel            3.1.1. Same/similar sound character. In the following 
                   to  represent  Hindi  vowel  sounds.  Only  exception  is      example sentence, the word “5167” is not transliterated 
                   vowel  “.”  whose  vowel  sound  maps  on  Urdu                correctly. In problem-word “t” sound character should 
                   consonant  and  vowel  character  sounds.  Current             be transliterated into “” instead of “”. 
                   transliteration  systems  don’t  provide  support  for  the     
                   independent form of this vowel. CRULP’s output for             (1)00$06?0@0AA))0$40
                   the sample word 9:"0(rishi) is “9شِ .                                                 AA))
                                                                                        0:(55116677(5(8."(9
                                                                                              55116677
                                                                                   
                      Here  we  are  writing  down  few  sample  words  by           Similar sound character problem always occurs due 
                   reading those a reader can have an idea of difference in       to  multiple  Urdu  characters  against  one  Hindi 
                   writing style of both languages.                               character,  as  can  be  seen  in  table  1.  For  the  same 
                                                         /                        reason,  the  wrong  selection  of  character  has  often 
                                  000   -.

                                   +                                              found for words that end on “”.  
                                  <200 0   0123.                                
                                                                                  (2)00+00$$22!!0$0%C
0)0
                                  00 0 0 4
3                                                    $$22!!  +
                                                                                  $60
                   3. Issues in Hindi-Urdu conversion                                          @,   /              /
                                                                                  8:(5
?(
;	(AB(

==>>33::(
.
;(<
                                                                                                     

==>>33::
                      The  paper  discusses  the  issues  in  Hindi  to  Urdu     
                   conversion  that  are  remained  unsolved  in  CRULP’s            In (2), for example, word “
=>3:” is written with “” 
                   and  Malik’s  system.  To  identify  these  problems  we       instead  of  “)”.  Table  4  gives  the  list  of  same  sound 
                   made a small survey of Hindi text available at [3] and         Urdu characters. 
                   [4].  We  transliterated  the  Hindi  text  to  Urdu  using     
                   CRULP’s Hindi to Urdu transliterator. The problems              Table 4:  List of same sound Urdu characters 
                   identified in the converted text are listed. We explored        
                   Malik’s  solution  to  find  whether  his  algorithm  and        Sounds            Urdu                Default 
                   structures  have  solution  of  these  problems.  It  was                          Characters list     Characters  for 
                   found that most of the problems are not solved by his                              for each sound      Transliteration 
                   model too.                                                                         Character 
                   The identified issues are of three types. The first type                                C              
                   of  issues  has  unsolved  problems  in  character  by               t sound 
                   character transliteration of Hindi text into Urdu script.           s sound           (&C($C%              $
                   But  there  are  issues  that  are  beyond  the  scope  of          z sound          (DC(EC(FCG            D
                   character  to  character  transliteration.  There  are 
                                                                              26 
                                                  Proceedings of the Conference on Language & Technology 2009 
                           a sound               CHCI                                 In Urdu when noon-ghunna comes in the middle of 
                                                                                          the word it is replaced by noon. Current transliteration 
                           a sound         CHCIC)-at-end                        systems  map  Hindi  nasalized  characters  with  noon-
                                                                                          ghunna of Urdu irrespective of its position in the word.  
                                                                                             The second issue is that few words in Urdu contain 
                     Systems that are built for Hindi to Urdu transliteration             character  “”  but  in  pronunciation  they  produce 
                     currently  have  fixed  transliteration  rules  defined  for         nasalized sound. 
                     same  sound  character  mapping.  Those  rules  map                   
                     Hindi’s  same  sound  characters  on  default  Urdu                  (5)000)P060@0FF((0)0$4?0
                     characters as defined in Table 4.                                                                 FF((
                                                                                                 Z0:(5?155YY

NNKK(5(" (5(5
                                                                                                               55YY

NNKK
                     3.1.2.  Characters  similar  in  Shape.  Transliteration             
                     errors that occasionally occur are primarily due to the 
                     charcters that are exactly identical in shape in Devnagri               Hindi  speakers  write  these  words  the  way  they 
                     script  and  differs  only  by  a  dot  addition.  Errors  are       pronounce it. That’s why in the result of transliteration 
                     rarely  found because of the missing dots and mostly                 we get “noon-ghunna” instead of “”, as in (5). 
                     due  to  the  pronunciation  differences  between  the                
                     speakers of the both languages.                                      3.1.4. Kasr-e-Izafat Issue. Kasr-e-Izafat is represented 
                                                                                          by (Zer) at the end of a word and is used toconnect 
                     (3)00)0EE))F0F $0#G0@0)0$H0                          two words to form the compound word e.g. [Y->(\ 

                                       EE))FF    +
                            5:(5(
(5(-?2(AB(JJ>>KKLL(5 
;                         Words  having  izafat  symbol  produces  [e]  sound 
                                                      JJ>>KKLL                          effect during pronunciation. In devnagri script there is 
                                                                                          no concept of izafat toproduce[e] sound that’s why 
                            Table 5: Chart of similar shape Hindi                         Indian native speakers use diacriticalmark [2]whose
                                             characters                                   independentformis[1]inplaceofthediacriticalmark
                      
                        Urdu          Hindi       Transliteration Errors                  []whoseindependentformis[']forwrittingthese
                        DC
      E,00 AM4DAM4N                              words. 
                                                                /                          
                        OC      J,00 (58P(58;                                (6)000))220N<0$80S0$40
                                                                /                                   ))22
                        QC                  !8RS!"L;
                                                                 T                        0:(J>V 8:]
12
(  

??>>VV
                                      K,00                                                                           T    

??>>VV
                        UC                   U
V
V                           
                                      A,00            T         T                        Thus,thiswrongdiacriticalmarkingasin(6)produces
                                                                                          () instead of Izafat sign (Zer). Solution of the above 
                     3.1.3. Nasalized sound character. In Urdu consonants                 is not present in either of the two systems. 
                     chart we have a single character to represent nasalized               
                     sound known as “noon-ghunna” ().                                    3.2. Different writing style  
                        Hindi  script  has  “chandrabindu”  (L)  and  “bindu”             
                                                                                             Even if a character by character mapping is modeled 
                     (F) diacritics to represent nasalized sound.  There are             successfully,  there  remain  few  differences  in  writing 
                     two  problems  in  mapping  of  chandrabindu/bindu  to               conventions  of  Urdu  and  Hindi.  These  problems  are 
                     Urdu  script.  The  first  problem  arises  when  this               beyond the scope of transliteration, and hence are not 
                     nasalized sound character occurs in the middle of the                discussed in the two earlier works [1] and [2], but these 
                     word, as shown in (4):                                               should be addressed because the Urdu reader expects to 
                                                                                          read the text having Urdu conventions.  
                     (4)00$0''FFNN020$40                                          
                                   ''FFNN                                           3.2.1. Native words. There is a difference in writing 
                                0:(X>(

WW(9                                     conventions of native Indic words in Hindi and Urdu. 
                                       

WW
                                         TT
                                         TT TT
                                            TT                                            Problem has been found in those words which end up 
                                                                                          on vowel sound. Hindi language can have words that 
                                                                                      27 
The words contained in this file might help you see if this file matches what you are looking for:

...Proceedings of the conference on language technology hindi to urdu conversion beyond simple transliteration bushra jawaid tafseer ahmed university malta universitaet konstanz germany bushrajd hotmail com gmail abstract text comfortably or we need deal issues which are this paper incorporates a detailed analysis existing work systems brief and finds enhancements they required it lists review that scope character by mapping include multiple same has already mentioned both languages use sound characters against one different scripts for writing here discuss these moreover deals with when word briefly words written in two ways is devnagri script read differences pronunciation spelling from left right all consonants style presents solution inherit vowels attached goes top bottom consonant an introduction vowel sign exception considered as styles non aspirated share grammar svara differ vocabulary uses syllable akshara formed combination zero more arabic persian nastaleeq nevertheless nastal...

no reviews yet
Please Login to review.