Recognition of Brazilian Baiano and Gaucho Regional Dialects on Twitter Using Text Mining

The internet has broken geographical barriers and brought people and cultures closer independent to their physical location. However, the language, idiom, dialects and accents continue to characterize individuals in their origins. The Brazilian regional dialect is the object of study of this research, which deals with linguistic corpora analyzed from a volume of data extracted on Twitter. This paper presents the results of the mining phase that makes up a first stage of the project to create a technique for recognizing the Brazilian Portuguese regional dialects. Analysis and conclusions were be made only for the baiano and gaucho dialects, considering the significant size of the samples and the need to reach a diagnosis of the collect data set. Author


Introduction
At the time of Brazil's discovery, studies estimated that more than 1.000 languages were spoken in the country. In the early years of colonization, the indigenous languages were used even by the Portuguese colonists and, being spoken by almost all inhabitants of Brazil, became known as a common language disappearing almost entirely in the eighteenth century, when the Portuguese language became the official language. With the escape of the Portuguese Court to Brazil in 1808, the Portuguese linguistic norm, which took on the dialects spoken between Coimbra and Lisbon, evolved into the dialect of Rio de Janeiro. Language is a system formed by rules and values assimilated and stored by the speakers of a particular language community, learned from the experience and shared experience with other speakers. However, it differs from the language that identifies the idiom of one nation in relation to others and is related to the existence of a political state. A dialect is a linguistic variety with grammatical, phonological, morphological, syntactic, semantic and lexical rules, duly known, even implicitly, by their speakers, and therefore there is no mechanism that determines their inferiority before a language. They vary according to their geographical location and, in some conditions, by their socioeconomic status. Recognition of these linguistic variations when observing a speaker is a daily action of the human being. However, identifying them automatically using a text analysis tool becomes a complex task that requires specific information retrieval, mining and classification techniques.
For this, a variant of data mining called Text Mining emerges, defined as the process of extracting interesting and non-trivial natural language patterns or knowledge from a set of textual documents (Tan 1999), making it possible to transform this volume of unstructured data, in useful and often innovative knowledge. The dependence of digital technologies on modern life is creating opportunities for the study of human behavior, as well as their social trends, based on what can be mined from networks. Although studies at Northeastern University (Mocanu et al. 2013) have already investigated the dynamics of languages on Twitter, for example the analysis does not evaluate or cross-check aspects such as the internet dialect, its origin and its geographical location, but specifically does not address issues involving the recognition of Brazilian regional dialects. For this work, in order to ensure assertive results, given that the data collection base has a significant volume and the number of words to be analyzed is equally vast and distributed, only two dialects were chosen: baiano and gaucho (Leite and Callou 2010), to intensify the research in these groups, reducing the scope and ensuring an accurate outcome when exposed to the method and tools used. The purpose of this article is to present the results obtained in the text mining process in a sample of Twitter posts on October 23 rd and 24 th , 2014, explaining the inferences found and the degrees of confidence and accuracy from the application of discovery of structured data knowledge (KDD) and discovery of knowledge in unstructured data (KDT). This is one of the stages of the project that proposes the creation of a technique for recognizing the Bahian and Gaucho dialects in any data collected from this social network. This document begins with a literature review and a brief approach to the concepts that underlie this proposal and the execution of the activities contemplated in the project. The following section presents the methodology and tools used, followed by the development and results sections.

Literature Review
For the development of this work, some fundamental concepts are needed as language, idiom, dialect, and accent, computational linguistics, information retrieval, knowledge discovery and text mining.

Language, idiom, dialect and accent
Brazilian Portuguese, regulated by the Brazilian Academy of Letters (ABL) is what is called the variety of Portuguese language spoken by over 200 million people in Brazil, ratifying it as the most widely spoken, read and written variant in the world. Throughout the history of this country, Brazilian Portuguese has incorporated terms from the Native American and African languages, French, Castilian, Italian, German and English, which together give rise to numerous regional variations. The language is a system whose structure is studied from a corpus also considered as a set of necessary conversions, adopted by the social body, to allow the exercise of language (Rabaça and Barbosa 1987;Rodrigues 2008). Thus, according to Travaglia (1997, 22): "[...] language is seen as a code, that is, as a set of signs that combine according to rules, and that is capable of transmitting a message, information from a sender to a receiver. This code must therefore be mastered by speakers for communication to be effective. As the use of the code that is the language a social act, consequently involving two people, it is necessary that the code be used in a similar, pre-established, agreed-upon manner for communication to take place". Idiom is any form of expression particular from a people. It refers to the language that identifies one nation in relation to others and is related to the existence of a political state and linked to the official language of a country. References can be found that treat language and idiom as equivalent terms (Almeida 1998). Dialet (from Greek διάλεκτος, translit. diálektos: talk, conversation, discussion by questions and answers; way of speaking, a country's own language (Houaiss, Villar, and Franco 2001) is a subset of idioms, defined by the way a language is spoken and understood in a given geographical region, determined by its own phonological, syntactic, semantic and morphological characteristics. For a dialect to be considered an autonomous language, linguistics considers that there must be mutual understanding between at least two communities and the existence of a common linguistic corpus, in other words, literary works considered inheritance of both without the need for translation. Note that there is an explicit difference between dialect and accent. While for Houaiss, Villar, and Franco (2001), the first is any regional variation of an idiom that does not compromise the mutual intelligibility between the main language speaker and the dialect speaker, the accent is a concept of popular use that in scientific terms do not exist, which usually designates only a change in the intonation of the word or the imperfect pronunciation of some phonemes performed by a foreigner.

Computational linguistics
Linguistics, as a science of language, deals, among other aspects, with texts, speech and dialogue. Words, which form structured sentences, are embedded in a situation, have independent predictable speakers and structures that can be formally described. Computational linguistics is the area of knowledge that explores the relationship between linguistics and informatics dealing with the computational treatment of language and natural languages for various practical purposes. It embraces a set of activities aimed at enabling communication with machines using the natural skills of human communication (Vieira and Lima 2001). It is divided into two subareas: corpus linguistics and natural language processing (NLP), which is concerned with the study of language aimed at the construction of specific software like automatic translators, chatterbots, parsers, automatic speech recognizers, among others that interprets or generates information provided in natural language. Corpus Linguistics, in turn, is the area of linguistics that deals with the collection and exploration of corpora, carefully collected, for the purpose of researching a language or linguistic variety through computer-extracted empirical evidence (Sardinha 2000). According to Sanchez (1995, 8-9 cited by Vale and Tagnin 2008), corpus is: "A set of linguistic data […] systematized according to certain criteria, sufficiently extensive in breadth and depth to be representative of all linguistic use or any of its scopes, arranged in such a way that it can be processed by computer, in order to provide various and useful results for description and analysis".

Information retrieval
The predominant language on the Internet is still English, but not as early as in the 1960s, when it was born from American researches to build a robust and flawless network communication (Leiner et al. 1997). It is estimated that today there are more than 5 billion words in portuguese on the network (Sardinha and Almeida 2008). This number can be collected from information retrieval (IR) techniques, methods and models that ensure the highest quality of information returned. IR is one of the areas of knowledge that make up and contribute to text mining. It is basically about extracting a certain unstructured volume of data that satisfies a specific need (Shiri 2004) that is typically stored on computers. With the effective use of the internet, search engines are daily examples of these tools either for web searches or for email box searches. However, linguistic resources are still scarce in these tools and little is observed regarding morphological and syntactic analysis, and the semantic analysis is implicitly done. Information retrieval is the area involved in obtaining relevant documents from a particular topic. It works with a set of techniques such as indexing, searching, filtering, organizing and handling multiple languages, which serve the purpose of finding relevant data according to a specific search (Vieira and Lima 2001). IR systems have traditionally been based on keyword or similarity search, but the most complex ones work with lexicons, knowledge bases, and ontology networks. However, they have limitations when they do not model relationships, dependencies, actions or events. Importantly, the words are vague, ambiguous and can have several meanings and express the same object through several words. Therefore, no study can disregard the use of elaborate semantic models that model the language in use, especially considering the relations and grammatical elements.

Knowledge discovery
Knowledge discovery is an automated, computer-supported process for analyzing data or information from its collection and processing with the primary purpose of enabling the acquisition of new knowledge by manipulating large databases. However, the discovery process is strongly related to the way information is processed. In the automation of this process, there are two approaches: KDD characterized by data mining, and KDT, the basis of this project, based on text mining (Fayyad, Piatetsky-Shapiro, and Smyth 1996). KDT has well-defined steps. Constituting its base is the collection stage: the process of searching and retrieving texts in order to form the textual database from which some kind of knowledge is to be extracted. However, the first challenge is the location of data sources. There are three main environments: the file folders found on users' disks, the tables of the various databases and the internet (Aranha and Passos 2007). The latter, with all its heterogeneity, is the data source of the proposed project, with Twitter as the source of the volume of texts collected.

Text mining
Text Mining or Text Data Mining is considered an evolution in the information retrieval area (Mugridge 2011). Inspired by data mining that extracts information in a structured database, text mining is a process that uses data analysis and extraction techniques from a processed volume of unstructured or semi-structured texts, involving computational algorithms aiming to find patterns and trends in a set of documents, sort them, and even compare them. In this way, it is possible to identify information useful to a specific purpose that could not be retrieved manually or following some traditional method of consultation, given its storage condition.

Methods and Tools
For the development of this work, these steps of the text mining process were contemplated: preprocessing, indexing, data mining algorithms and results analysis. Also known as text categorization, classification was the mining technique chosen for the purpose of predicting, in the collected analysis, which class belongs to each post: baiano, gaucho or neither (not applicable). As support tools were used the software R, R Studio and Weka, with the execution of algorithms and command lines necessary to classify the extracted data volume.

Experimental Work
The database evaluated 22.613 random posts between October 23 and 24, 2014 from an automated collection that referenced the Twitter API. Although extracted 40 attributes relevant to each tweet, only the so-called TEXT was used in this work, exported from command lines in MySQL and composing a single text file with all records regarding the posts collected. One of the outputs foreseen in this study was the construction of the word cloud, extracted from the R software, after term frequency counting in an online tool available on the Insite Linguistics Group (n.d.), which aims to research and develop products related to the areas of Natural Language Processing (NLP). Thus, processes for processing the database became necessary, such as converting all characters to lowercase, removing punctuation and numbers, and suppressing However, it was observed that, in the format in which the database was located, the desired classification would not be possible from R, requiring new post manipulation to perform the next steps in the Weka tool. Thus, the consolidated data volume after the first preprocessing was re-exported, this time considering 8 attributes: text, source, place, user_name, user_screen_name, user_location, user_description, and user_lang, unlike the first time with only the text column. The choice of these attributes was due to the need for new filtering to reduce the number of records, given the hardware limitations presented when processing the base with 13.081 found. The user_lang = "pt" and non-empty place attributes were then filtered, reaching a sample of 435 records. Although insignificant in its quantity, the study continued as proposed for the classification to be completed and a result to be found as expected. The next step was the creation of a specific text file for each post, that is, 435 files, manually tagged in the provided classes: Bahia, Gaucho, none (Figure 1), premise for using the SCA-Classifier tool (Matos 2010), user-friendly interface of Weka algorithms and features.

Results
The use of the SCA-Classifier tool allowed the execution of classification algorithms such as Naive Bayes and ID3, as well as the training of the obtained base, reaching the results shown in Figure 3 and Figure 4. The Naive Bayes classifier is the most used for machine learning. Called naive, it assumes that attributes are conditionally independent, in other words, the information of an event is not informative about any other (Fayyad et al. 1998) and does not require a large amount of data to reach a reliable and meaningful results. In the proposed sample, approximately 92% of instances were considered to be correctly classified, while only 8% were considered to be incorrectly classified. ID3 also applied to this data sample uses decision trees and is among the most popular inductive inference algorithms. The main choice is to select which test attribute will be used in each node of the tree. Thus, a statistical property called "information gain" is defined, which measures how a given attribute separates training examples according to their classification. ID3 uses the "information gain" to select among the candidates the attributes that will be used at each step while building the tree. Although distributed differently, the percentage of instance classification in this algorithm was similar to that of Naive Bayes classification, as well as its precision 1. for the dialects in this same order.

Conclusion
The information society is the main beneficiary of computational linguistics, as studies in speech, text and image processing are advancing to make the vast amount of information currently available on the world wide web, more accessible. The use of corpus for analysis and validation of a data sample has been used for centuries, but the use of computers to treat these samples is still incipient. Whether due to the lack of tools or the lack of professional technical knowledge in this field, the role of technology in this study has been relevant in recent years. The larger a corpus is, the greater its representativeness. Thus, for the study of the lexicon, considering that there are rarely used words, such as the dialects presented here, the larger the corpus, the greater the possibility of the appearance of relevant terms in the sample. For the work mentioned, which is an integral part of the Baiano and Gaucho Brazilian Dialect Recognition project on Twitter, the sample size and the format in which the texts are posted on this social network were limiting factors in the analysis. With a maximum length of 140 characters and the removal of symbols, punctuation, stopwords, and numerals, terms that are irrelevant to the proposed mining, the post is reduced to a size that can compromise the end result. However, the application of the KDD and KDT techniques proved that it is possible to achieve a reliable end result, although there is still significant manual work in preprocessing. The algorithms used were selected for their ability to work effectively even in small data volumes, presenting similar results in the classification performed, either by machine learning or by decision tree.
As a future proposal, it is first suggested to automate the categorization process which will allow the analysis and mining of a more significant data volume, as well as an increase in the capacity of the hardware used so as not to compromise the execution of the techniques and algorithms, enabling future real-time analysis of messages posted on both Twitter and any other social network.