Blick ins Buch

Automatic Text Summarization (eBook)

eBook Download: EPUB

2014 | 1. Auflage
100 Seiten
Wiley-Iste (Verlag)
978-1-119-04407-9 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (EPUB)

This new textbook examines the motivations and the different algorithms for automatic document summarization (ADS). We performed a recent state of the art. The book shows the main problems of ADS, difficulties and the solutions provided by the community. It presents recent advances in ADS, as well as current applications and trends. The approaches are statistical, linguistic and symbolic. Several exemples are included in order to clarify the theoretical concepts. The books currently available in the area of Automatic Document Summarization are not recent. Powerful algorithms have been developed in recent years that include several applications of ADS. The development of recent technology has impacted on the development of algorithms and their applications. The massive use of social networks and the new forms of the technology requires the adaptation of the classical methods of text summarizers. This is a new textbook on Automatic Text Summarization, based on teaching materials used in two or one-semester courses. It presents a extensive state-of-art and describes the new systems on the subject. Previous automatic summarization books have been either collections of specialized papers, or else authored books with only a chapter or two devoted to the field as a whole. In other hand, the classic books on the subject are not recent.

Juan-Manuel Torres-Moreno is Associate Professor at the Université d'Avignon et des Pays de Vaucluse (UAPV) in France and is head of the research team Natural Language Processing (NLP/TALNE) at the Laboratoire Informatique d'Avignon (LIA). His current research lies within the field of NLP where he is investigating techniques for ATS. His other research interests include sentence compression, information retrieval, machine learning and artificial consciousness.

2 Automatic Text Summarization: Some Important Concepts

The aim of an automatic Text Summarization system is to generate a condensed and relevant representation of the source documents. The representation, which is in fact a compression involving the loss of information, must preserve the main points of the original document. However, there are countless and heterogeneous sources of information that are capable of being summarized: videos, images, sound and text documents. This work will focus on text document summarization. A summary can be produced from one or several documents. When producing a generic summary, each topic is given the same level of importance during processing; when generating a guided summary, only the information desired by the user is processed. Finally, summarization tasks can select and analyze new information from a set of temporized events and even be multilingual. This chapter will provide an overview of automatic text summarization: preprocessing, types of summary, their uses and generation algorithms. An introduction to the problems related to the evaluation of summaries will also be provided.

2.1. Processes before the process

Automatic text summarization is a complex process that must be broken down into modules if it is ever to be mastered. One such module is text preprocessing. Preprocessing enables a sequence of bits, i.e. a plain text document, to be transformed into an object with minimal linguistic features, such as words and sentences. Preprocessing has two objectives: word normalization and lexicon reduction in a text.

Both of these functions are extremely important for the vectorization of documents (section 2.1.1), because the space of representation is considerably reduced and becomes easier to manage. Therefore, preprocessing is essential, as it provides summarization systems with a clean and adequate representation of the source documents.

In simplified terms, (almost obligatory) preprocessing often involves the following stages:

– splitting the text into segments, sentences, paragraphs, etc.;

– splitting segments into words or tokenization1;

– word normalization (lemmatization, stemming etc.)2;

– filtering stopwords, etc3.

Optional preprocessing can include:

– annotation via grammatical tagging or parts-of-speech (POS)-tagging;

– named entity recognition (NE)4;

– extraction of terms and keywords;

– weighting terms (with the binary weight, number of occurrences or others according to the representation, see section 2.1.1 and appendix A.2.1), etc.

Figure 2.1 shows a diagram of standard preprocessing that can be applied to source documents before the summarization process.

Figure 2.1. Standard text preprocessing

Preprocessing is a difficult task that depends to a large extent on the language in which the text is written. Sentence boundaries, for example, are demarcated by punctuation, context and quotations, the use of which varies considerably from language to language. Moreover, not all languages separate words with spaces. In fact, there are considerable differences between splitting a text written in a European language in Latin characters and an eastern language such as Chinese, Japanese or Korean. This problem is also encountered in other stages such as the normalization of words through lemmatization or stemming, POS tagging5 or NE recognition (see Appendices A.1.5 and A.1.4). Even eliminating stopwords (articles, conjunctions, prepositions, etc.) is not a simple task.

When the entry documents are written in an unknown language it is useful to use an automatic language identification module6. The TREETAGGER tool [SCH 94] has become very popular for POS-tagging: it is available in several languages and is based on machine learning (ML) and decision trees7. Often involved in preprocessing, word count, word tokens (occurrences of a word) and word types (unique word as a dictionary entry) are essential for statistical methods, particularly for calculating the weight of words.

Preprocessing is also used in other natural language processing tasks, such as the thematic classification of documents. Automatic summarization systems give poor results when the documents to be summarized have not been correctly preprocessed. More information about preprocessing is available in Appendix 1 and in [PAL 10], which contains an excellent chapter about preprocessing.

2.1.1. Sentence-term matrix: the vector space model (VSM) model

Once preprocessing is complete and the summarization method has been chosen, a representation of the documents is required. The VSM [SAL 68, SAL 83, SAL 89] is widely used in information retrieval (IR) to represent the documents and terms in an adequate space (see appendix A.2). Several automatic summarization algorithms use this model. In this presentation, we are going to adapt this IR oriented model, defined with N terms and D documents, to a single document with N terms and ρ sentences. Transposition remains valid.

In this model, word order is not important. It is known as the bag-of-words model. Each word (term) is given a weight ω, which measures its importance in the document. There are three types of weighting (see Appendix A.2):

Binary: ωμ,j equals 1 if the term j is in the sentence μ, and 0 if not;

Frequency: number of occurrences of the term j in a sentence μ (term frequency, tf):

Corrective: using a correction frequency function to take into account the distribution of the word in sentences (inverse document frequency, idf).

idfj measures the importance of the term j in the set of ρ sentences and D(j) is the number of sentences in the document in which the word j appears.

Thus, a matrix S[ρ×N] of ρ rows (sentences) and N columns (terms in the lexicon) is generated. Vectorization transforms a document into a set of ρ vectors sentence . Each sentence is represented by a vector .A lexicon (reduced through simplification during preprocessing) of N words produces a sentence-term matrix S = [sμ,j]; μ = 1 , … , ρ; j =1, … , N:

[2.1]

where each row μ contains the weighting ωμ,j of the word j in a sentence .The matrix S[ρ×N] is a representation of the important information contained in the source document in an exploitable format. Graph-based models, when ρ nodes represent the sentences sμ and an edge αk,μ of values of similarity (cosinus, Jaccard, Dice, etc.) between sentences sk and sμ, are also used a great deal in automatic summarization (see section 3.5).

2.2. Extraction, abstraction or compression?

[MAN 01] defines a summary as a document containing several text units (words, terms8, sentences or paragraphs) that are not present in the source document. However, this definition is too restrictive as during sentence extraction it is only possible to apply the selection process to fragments of sentences. The resulting extract will, therefore, be composed of units that are not presented in exactly the same way as they were in the original document [BOU 08a]. An intermediate type of text, the condensed text, exemplifies the process used by the majority of summarization systems: important sentences are identified and extracted, and subsequently assembled and reformulated [MAN 01]. In English, there is no ambiguity between the meaning of summary, and the meaning of abstract or extract:

– a summary is a reduced representation that keeps the essential content of a text;

– an abstract is a summary produced by reformulating sentences;

– an extract is a summary produced by extracting sentences from the source text.

However, in languages such as French or Spanish, misuse of the word “résumé” (or “resumen” in Spanish) has led to the meanings of summary and abstract being conflated and resulting in ambiguity.

A carefully selected title of a document can be considered as the maximum summary of a text. In fact, the task of automatically generating titles is related to automatic summarization. It consists of producing the title of a document or a paragraph from the content of the document or paragraph in question. The title must be short, contain the important information from the source and, most importantly, be concise. However, the task of automatically generating adequate titles is very complex. For more information on the topic, see the works of [GÖK 95, BAN 00, JIN 03].

When discussing automatic text summarization we very often talk about the text corpus. A corpus is a large collection (sample) of documents in written or spoken language. The corpus provides a platform for analyzing language, verifying linguistic theories and conducting Machine Learning (ML) on Natural Language Processing (NLP) algorithms for specific tasks. As this work is only concerned with written documents, we will not be looking at recognition, transcription and speech summarization tasks. Written corpora are...

Erscheint lt. Verlag	25.9.2014
Sprache	englisch
Themenwelt	Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik
Schlagworte	Computer Science • Informatik • Informationstechnologie • Information Technologies
ISBN-10	1-119-04407-3 / 1119044073
ISBN-13	978-1-119-04407-9 / 9781119044079

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

EPUB (Adobe DRM)

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.