Prediction of highly lucrative companies using annual statements: A Data Mining based approach

Weinblat, Jurij

Prediction of highly lucrative companies using annual statements: A Data Mining based approach

by Jurij Weinblat (Author)

Business economics - Investment and Finance

53115

Summary

The intention of this study is to predict one year in advance whether a regarded firm will grow extraordinarily in the next year. This is crucial for private investors and fund managers who need to decide whether they should invest in a certain firm. Companies like Apple and Amazon have shown that people who recognized the potential of such companies at the right time earned a lot of money.<br>The applied prediction models can also be used by politicians to identify companies which are eligible for funding, because growing companies oftentimes hire many employees.<br>Since annual reports are often publically available for free, it is reasonable to take advantage of them for such a prediction. The prediction models are based on classification trees and forests because they have some very substantial advantages over other methods like neural networks, which are frequently used in literature. For instance, they do not have distributional assumptions, accept both quantitative and qualitative inputs, and are not sensitive with respect to outliers. Furthermore, they are easy to understand by humans and can deal with missing values, which is crucial for practical applications.

Excerpt

VIII

List of tables

ABLE

MPORTANT PARAMETERS OF THE CLASS VARIABLE

... 27

ABLE

ELECTED QUALITATIVE KEY FIGURES

... 29

ABLE

ELECTED ABSOLUTE KEY FIGURES

... 31

ABLE

ELECTED RELATIVE KEY FIGURES

... 36

ABLE

OUR ENTRIES OF THE EXAMPLE DATASET

... 43

ABLE

ROSS VALIDATED CLASSIFICATION TREE RESULTS FOR

2010 ... 62

ABLE

RPART'

S PREDICTION OF LUCRATIVENESS FOR

2011... 65

ABLE

ESULTS FOR THE REDUCED DATASET

... 67

ABLE

ROSS VALIDATED CLASSIFICATION FOREST RESULTS FOR

2010 ... 68

ABLE

10:

ANDOM FOREST

S PREDICTION OF LUCRATIVENESS FOR

2011 ... 69

ABLE

11:

ESULTS OF THE MEASURES TO IMPROVE THE PRECISION OF THE FORESTS

... 70

List of abbreviations

BvD ... Bureau van Dijk Electronic Publishing GmbH

CART ... Classification and Regression Trees

csv-file ... Comma-separated-values-file

CV ... Cross validation

DM ... Data Mining

FN ... False negative

FP ... False positive

IQR ... Interquartile range

NA ... Not available

RF ... Random forest

RM ... Reference model

ROE ... Return on equity

SQL ... Structured Query Language

TN ... True negative

TP ... True positive

Acknowledgment

I would also like to thank my supervisor Prof. Dr. Andreas Behr for assisting

me with this book. He provided me with valuable suggestions and gave me the

opportunity to write about my favourite topic. Thank you very much!

I would also like to say thank you to my parents and my girlfriend Sarah for

their support during my entire work.

1. Introduction and problem description

In literature, a lot of scientists describe how to use annual report data to predict

whether a certain company is going to become bankrupt (Dimitras, Zanakis

und Zopounidis 1996, 487513). The reasons why this topic attracts such a

high degree of scientific attention is rather obvious: The stability of the finan-

cial system depends on the ability of banks and other financial service provid-

ers to assess whether a certain firm will be able repay a loan or not. Further-

more, banks need this information to be able to calculate an adequate probabil-

ity of default to identify a minimum interest rate for a concrete loan (Moro und

Schäfer 2004).

Nevertheless, it is not only relevant to anticipate this worst case of bank-

ruptcy, but also whether a regarded small firm will grow extraordinary in the

next year and maybe even become a big company in the medium term. This is

crucial information for private investors and fund managers who need to decide

whether they should invest in a certain firm. Companies like Apple and Ama-

zon have shown in the past that people who recognized the potential of such

companies and bought their shares have earned a lot of money.

The prediction models, which are described in this paper, can also be used

by politicians to identify companies which are eligible for funding. Because

growing companies oftentimes hire many employees, it might be meaningful to

facilitate their development process by selective subsidies to reduce unem-

ployment. Furthermore, it is possible to question the prediction results of a fi-

nancial analyst if he came to a different conclusion than a model.

Since annual reports are often publically available for free, it is reasonable

to take advantage of them for such a prediction (Gräfer 1988, 52). Additional-

ly, various information providers maintain huge databases with annual reports.

A big data approach promises to further improve accuracy of predictions

(Rauscher und Rockel 2001, 5). This paper introduces methods, which enable

to generate knowledge out of these huge data sources to identify extraordinary

lucrative firms.

To generate these prediction models, a data mining approach is used which

is based on the approved CRISP-DM proceeding model for data mining pro-

cesses. CRISP-DM ensures comparability and the consideration of best prac-

tices (Chapman, et al. 2000, 1-2). The prediction models are based on classifi-

cation trees and forests because they have some very substantial advantages

over other methods like neural networks, which are frequently used in litera-

ture. For instance, the underlying algorithms of the used model do not require a

certain distributional assumption, accept both quantitative and qualitative in-

puts, and are not sensitive with respect to outliers. But the two most important

advantages are that a tree can be easily interpreted by users which is important

for the previously described stakeholders because it is not easy to trust the re-

sults of a model which one does not understand (Löbbe 2001, 199). This is why

a lack of understanding might impede the practical implementation of such a

model. Besides that, the used algorithms can handle missing data which occur

very often in the available dataset. In other analysis, these data entries would

have been removed even if only one value is missing. This reduces the often al-

ready relatively small amount of available data and can reduce the model's ac-

curacy (Neeb 2011, 67, Franken 2007, 5). This is not the case for the applied

methods.

1.1 Intention of this study

The intention of this paper is to determine whether a stakeholder can use a

classification tree or classification forest at the beginning of one year to identi-

fy German firms which will grow exceptionally in this year using annual re-

ports' key figures from previous years. As a first step, key figures from the

years 2007, 2008 and 2009 are used to generate different trees and forests

which can predict whether a company grows outstandingly in 2010 or not.

These models require the lucrativeness information from 2010 to be generated.

To evaluate how well these unchanged models would work for the mentioned

stakeholder at the beginning of the year 2011, they are also applied to data

from 2008, 2009 and 2010 as a second step. This means that this time, the

models are applied to more recent data to anticipate whether the regarded firms

will grow intensively in 2011. Data from 2011 is only used to check the predic-

tions' correctness and not to generate models. The best identified models are

also compared and analysed.

These four particular years have been chosen because the available dataset

"Amadeus" only contains a relatively small amount of more recent data. It is

probably not necessary to regard more than three years for the generation of

these models because it is shown in literature that this data is not able to no-

ticeably improve prediction (Pytlik 1994, 94).

One important characteristic of this paper is the usage of the CRISP-DM

model, which is frequently used by data analysts and helps them to easier un-

derstand the analysis. Furthermore, this model encompasses important best

practices which could otherwise be overseen.

Another distinctive feature of this analysis is that the used dataset has a total

size of about 18 gigabytes. The reason why it is so huge is that it contains more

than 24 million entries each of which contains up to 158 attributes of over 3

million different European companies. Such dimensions are untypical for such

analysis

and overburden frequently used software like R. Furthermore, this

analysis overstrains most current desktop computers because they do not have

enough main memory for it. Besides that, this dataset does not have a typical

database structure. This is why this analysis meets two important criteria (size

and complicated structure) of Big Data-analysis (IBM Corporation Software

Group).

1.2 Proceeding

To reach all these goals, the following proceeding is chosen. Since this Data

Mining analysis is based on key figures from annual reports, chapter 2 de-

scribes some main principles of such an analysis, why it is more powerful than

other analysis using older techniques and which drawbacks the usage of annual

statement has in general. This section provides some explanations why the

generated models sometimes fail to do a correct prediction.

Chapter 1 presents the used dataset and which requirements its entries have

to meet in order to be analysed in this book. It is, moreover, explained in the

appendix how to solve some of its structural problems.

After the available data is described, chapter 1 presents and elucidates all

the qualitative and quantitative key figures which are used to predict growth.

These key figures have to meet certain demands, to be meaningful which are

presented as well. Furthermore, it depicts how it is determined whether a com-

pany is lucrative and has, therefore, grown outstandingly or not because there

are several different possibilities to do that. Based on these key figures, a first

Kumar and Ravi have shown in their review that the majority of bankruptcy prediction stud-

ies do not analyse more than 9000 firms simultaneously (2007, 6-7).

analysis is carried out to find differences between lucrative and not lucrative

companies.

The next chapter illustrates why classification trees and forests are used in

this book and which software is used to generate them. The methods are ex-

plained based on a simple example which is presented, too.

Chapter 1 contains the actual analysis and a comparison of the different re-

sults. The obtained findins are summarised in chapter 1 and a conclusion is

drawn.

2. Introduction to key figure analysis

Because the main part of this book is an analysis of annual statements' key fig-

ures, it is necessary to explain what key figures are in general and what ad-

vantages and shortcomings such an analysis has. This explanation can be found

in this chapter. Moreover, the assumptions and risks of the data analysis are

presented which are crucial for the entire project.

This chapter is part of the Business Understanding phase because it explains

the context of the upcoming data analysis. Some further aspects of this phase

like the target group of this analysis have already been mentioned in the intro-

duction to meet the structure of a scientific paper.

2.1 The principle of key figures

A key figure is a condensed indicator for a certain quantifiable issue. It pro-

vides information in a way that the beholder gets a quick overview of the most

important aspects of this issue and it points out abnormalities (Pook und Tebbe

2002, 104). In this study, such a figure has to inform about the economical

state of the regarded enterprise.

Every key indicator should be designed in a way that it has a clear meaning

because, in theory, it is of course possible to put arbitrary figures in the numer-

ator and denominator of a key figure (März 1983, 80). But even the value of a

well-designed key figure often does not have an own semantic but only gets a

meaning when this value is compared to the value of another company or to a

certain reference value (Johnson 1970, 1167). Furthermore, it is advisable to

look at the development of certain key indicators (Pytlik 1994, 98).

It is important to mention that there are both absolute key figures, relative

key figures, and proportional key figures (Mittag 2011, 73).

Several key figures can be combined to a so called "key figure system"

which aims for representing managerial interdependencies and certain external

influences (Löbbe 2001, 24). Such a system can be used to compare several en-

terprises even if they are different in some respects (Schult 2003, 15).

2.2 The classical key figure analysis approach

Such a key figure system can be used to analyse the economical state of a

company. To do this, the enterprise is assessed based on subcategories like as-

set structure, rentability, and liquidity. All necessary key figures for such an as-

sessment are calculated based on the figures, which are published by the com-

pany. The results of these subcategories are then combined to an overall result

(Löbbe 2001, 23, Franken 2007, 3-11). Such an analysis should also provide

information about how rich or poor a firm is, why and how much its assets

have changed, and how successful it will be in the future (Löbbe 2001, 35,

Franken 2007, 3).

This depiction creates the impression that such an approach can be used in

this paper to predict whether a company is going to realise profit or to incur a

loss. But this is not the case since these techniques have many shortcomings,

which are summarised in the following paragraph.

These classical approaches try to conclude the current state of an enterprise

from certain key figures (Löbbe 2001, 34). This is very problematic because

there are no proven theories which could enable such a deductive reasoning but

only evidence about certain interdependencies. This is why an analyst has to

make a lot of assumptions about which key figures to look at, how strong the

impact of each of them is, and how to combine the results of the subcategories

to an overall assessment. Furthermore, it is in most cases ambiguous whether a

certain value of the regarded figure is actually a "good" or a "bad" sign. This

makes such an approval highly subjective. Besides that, using not enough fig-

ures reduce the semantic of the key figure system as some important aspects

are not considered. Using a too big system makes it very hard to get a quick

overview. Because of that, identifying an appropriate number of regarded fig-

ures is also not trivial (Löbbe 2001, 33-46, Hauschildt und Baetge 2000, 115,

Küting und Weber 1994, 342).

But there are many other problems, too. One of them is that the results of

these classical approaches are often not precise enough (Moro und Schäfer

2004). Moreover, such judgements require a lot of time and generate relatively

high costs because they almost do not benefit from modern information pro-

cessing (Nanni und Lumini 2009, 3028).

Because of this lack of a theoretical foundation, the insufficient precision

and the high cost, these approaches are not used in this book.

2.3 Modern key figure analysis approaches

All these disadvantages of the previously mentioned approaches motivated sci-

entists to develop new kinds of methods to make predictions based on key fig-

ures from annual statements. The ongoing evolution of digital information pro-

cessing, which enables the practical application of these methods, is also a very

important reason why the significance of these methods continues to increase

(Löbbe 2001, 35).

Because there is no proven theory about the dependencies between key fig-

ures and the state of the corresponding enterprise, this approach encompasses

various data mining methods.

Data Mining (DM) is both the science and art of intelligent data analysis,

which aims for gaining insights into the data and for learning about interesting

patterns and trends (Williams 2011, VII, Hastie, Tibshirani und Friedman

2009, 8, Han, Kamber und Pei 2011, 8). A pattern is usually regarded as rele-

vant if it is universally valid, not already known by the user, and is useful and

understandable for him. Such relevant patterns are regarded as knowledge

(Runkler 2010, 2).

The identified knowledge is often represented as models, which are a struc-

tured representation of the underlying data. Models are sometimes also called

"learners". They can, further on, be used for predictions or to learn more about

the data (Williams 2011, 3-4, Hastie, Tibshirani und Friedman 2009, 20-21).

DM was introduced by the database community in the 1980s and is now al-

so advanced by statisticians and artificial intelligence scientists (Williams

2011, VII). Statistics added various computational methods and visualisation

techniques to DM. Artificial intelligence contributed its focus on heuristics,

and the database experts provided the knowledge how to efficiently store and

access large amounts of data which have to be analysed (Gorunescu 2011, 2-3).

Nowadays, different kinds of data like data from social media, patient data

and data from the retail industry and science was collected (Han, Kamber und

Pei 2011, 2). DM methods can be used to analyse this data and to predict heart

attacks, identify cancer, anticipate share prices, and recognize spam emails

(Hastie, Tibshirani und Friedman 2009, 20-21).

There are several different approaches how to categorize DM methods. One

of them is presented here. The first category is "characterization and discrimi-

nation", where the properties of certain user defined classes should be ana-

lysed. In the grouping "mining frequent patterns", item sets and patterns which

occur frequently within the data are identified. In case of "classification and

regression", the classes or a certain target value of not yet classified objects

have to be determined. These methods require a certain amount of already clas-

sified objects to determine the classification model. Methods of "cluster analy-

sis" try to identify objects which belong together based on similarity considera-

tions when no class information exists in advance. Moreover, there are also

methods for the detection of outliers. These are objects, which are very differ-

ent from most of the other objects (Han, Kamber und Pei 2011, 15-21).

The analysis of this study is a classification task. Firms are classified as

firms which will grow intensively (=class 1) or will not grow or will even

shrink (=class 2). The right class is not known in advance and is determined

based on concrete annual report data of the regarded enterprises from previous

years (Anders und Szczesny 1999, 3). The corresponding DM methods cannot

give reasons for the underlying observations but can be used for predictions if

the assumption is true that the identified trends or patterns stay valid up to the

prediction point (Löbbe 2001, 34, Franken 2007, 1). Besides that, the used

methods enable to get an impression about the quality of the generated results

(Hauschildt und Baetge 2000, 115). The classical approach does not offer this

possibility.

Edward I. Altman was the first scientist, who used such a modern approach.

He applied the multiple discriminant function analysis to annual report data

(Löbbe 2001, 46). Because of this method's very restrictive assumptions on

linear separability, multivariate normality and independence of the predictive

variables, other authors have applied other methods to this kind of data

(Chandra, Ravi und Bose 2009, 4831). Examples of other used data mining

methods are neural networks, decision trees, and support vector machines

(Kumar und Ravi 2007, 4-13).

An important advantage of such a data mining approach is that they meet

the principles of the analysis of annual statements: The results meet the objecti-

fication principle because they are generated based on empirical data. They al-

so meet the neutralisation principle because the importance of each key figure

is determined by the used method. Last but not least, they meet the holism

principle since both the assets, finances, and yields are taken into account

(Baetge und Henning 2008, 279).

Furthermore, it is important to point out that it is possible to combine the re-

sults of the classical and the data mining approaches to benefit from all of their

advantages simultaneously.

Since the classification and prediction model is now built based on given

data the quality of this data directly influences the quality of this model and has

to be taken into account (Löbbe 2001, 137).

It is important to mention that DM is not just a collective term for various

data analysis methods but describes an entire process which is carried out as a

project. In such a project, DM experts, data experts, and domain experts have

to collaborate to bring together the knowledge how to analyse data, how to ac-

cess the data, and how to understand the data's semantic. Moreover, the actual

target of the DM project and the intended proceeding is often not clear at the

beginning and is often specified based on first results. Even after the proceed-

ing and the targets are specified, it is often necessary to return to previous stag-

es because of certain new insights. Furthermore, several models are created,

tested and improved in the course of the project until a satisfactory perfor-

mance is achieved (Williams 2011, 5-8, Runkler 2010, 3).

The CRISP-DM reference model is the most common one and encompasses

plenty of best practices. To benefit from these best practices, this model is con-

sidered in this book. The model's description can be found in the appendix.

2.4 Limitations of annual report analysis

At the end of this chapter it is important to point out important general as-

pects of analysing annual statement data because these facts directly influence

the quality of the created model.

First of all, annual reports are not originally designed to be used as a foun-

dation for predicting growth but rather concern the past by telling how wealthy

the company is and why its assets has changes. This means that the annual re-

port is diverted from its intended use (Franken 2007, 3).

Another problem, especially in context of small and middle-size companies,

is that their success strongly depends on the manager of this company. Unfor-

tunately, most used datasets do not contain any information like age, gender

and education of this person (Anders und Szczesny 1999, 1-2).

Furthermore, there is often no information about the enterprise's strategic

goals, its capability to be innovative, the professionalism of the manager and

his staff, and the customer focus. All these aspects influence whether a compa-

ny is going to be successful but cannot be used because they are either not

available at all or very hard to operationalize and, therefore, require controver-

sial generalisations (Moro und Schäfer 2004, Fritz 1993, 1, Feldo 2011, 8).

But even the available information cannot be regarded as objective which

influences the informational value of the key figures as well. The reason for

this is that the companies have a certain level of autonomy of decision as far as

the calculation of certain values is concerned so that two identical companies

can legally create different annual statements. At least some companies take

advantage of this to create their annual statement in such a way that they have

to pay less taxes (Löbbe 2001, 43, Rauscher und Rockel 2001). Besides that,

annual statements are not instantly available at the beginning of a year so that

analysts have to wait until they can use this information for prediction. If they

need the outcomes of their predictions earlier, they have to rely on older data.

This degrades the accuracy of their prediction (Löbbe 2001, 43).

But there is another kind of problem, too, which is caused by rather mathe-

matical reasons. One of them is that even if all required values are available,

some key indicators cannot be calculated because its denominator has the value

zero. In huge datasets, this most likely occurs a few times so that these firms

have to be removed, too (Löbbe 2001, 138). Moreover, the same value of the

same key indicator can be a result of completely different initial values which

are divided by each other. For example, both 2/4 and 333/666 have the same

result 0.5 which, on the one hand, makes it possible to compare completely dif-

ferent enterprises as mentioned before but, on the other hand, makes it compli-

cated to conclude certain properties of the firm from such a division result

(Franken 2007, 9).

Despite all these shortcomings of annual statements, Gräfer still points out

that it is meaningful to use them for prediction purposes because they are often

the only publically available source of information and still contain a lot of

useful data (1988, 52).

. The

n this chapte

he content a

f the Data U

.1 Desc

he dataset o

ng GmbH" (

ormation pro

or analysis a

The datas

nd compani

ncorporated

hermore, thi

astern and

ompanies ar

specially the

an Dijk Elec

Amadeus

es are enclo

abulator cha

In this ana

86 features)

pproximatel

resses of th

ustry they o

vailable d

er, the used

and the stru

Understandin

cription o

originates fr

(BvD). BvD

oviders, com

and research

et contains

ies which ar

firms (Bur

is dataset, w

western Eu

re inside Am

e annual re

ctronic Pub

is stored in

osed in quo

aracters. The

Illustr

alysis, only

and finance

ly nine giga

he regarded

operate, whi

dataset

d dataset of

ucture of the

ng phase of

of the da

rom the com

D obtains d

mbines this

h purposes.

both comp

re not or no

reau van Dij

which is cal

urope. In to

madeus. To

eport data w

lishing Gmb

n five Comm

otation mark

e structure

ration 1: Stru

y two of the

e data (72 f

abytes. The

companies

ich importa

this paper i

e dataset ar

f CRISP-DM

ataset

mpany ,,Bur

digitalised d

data, and p

BvD also c

panies whic

o longer lis

jk Electron

lled "Amad

otal, approx

o enable com

was collecte

mbH 2013).

ma-separate

ks, and con

of such a c

ucture of the a

e five csv-fi

features). Ea

master file

. Additiona

ant trademar

is described

re illustrated

reau van Dij

data about c

provides this

collects som

ch are listed

sted with an

nic Publishin

deus", encom

ximately th

mparisons o

ed in a stand

ed-values fil

nsecutive va

sv-file can

available csv-

iles are requ

ach of these

e data conta

ally, it is m

rks they pos

d. In this co

d. This chap

ijk Electron

companies f

s data to its

me of its data

d at a stock

n emphasis

ng GmbH 2

mpasses rec

hree Million

of internatio

dardised wa

les (csv-file

alues are se

be seen in I

-file

uired: maste

e files has a

ins the nam

mentioned in

ssess, and w

ntext, both

pter is part

nic Publish-

from its in-

customers

a by itself.

k exchange

upon non-

2013). Fur-

cords from

n different

onal firms,

ay (Bureau

es). Its val-

eparated by

Illustration

er file data

file size of

mes and ad-

n which in-

where most

of their goods are produced. As it can be seen in Illustration 1, there are often-

times more than one row for the same company. This seems to be the case if

the corresponding feature is a descriptive feature and, therefore, has more than

one value for this company at the same time (Bol 2004, 16). For instance, this

is the case if a company has changed its name several times and consequently

has more than one former name. In these cases, only the first row is complete

and all the other rows just contain the same "BvD ID number", company name,

and the additional feature characteristics. Such a file structure enables to avoid

redundancy and to reduce the file size.

The finance dataset contains the actual annual reports. Every row represents

exactly one report the date of which is saved in the column "Account date".

Other characteristic features are the gross profit, the number of employees and

the costs of materials.

Another very important column is the already mentioned "BvD ID number",

which is unique for every company and enables to merge data from several

csv-files. If, for instance, the user requires the industry code for a given annual

report, he just has to go through the master file data and look for the first row

which has the same "BvD ID number" as the annual report.

3.2 Data

clean-up

Like in most databases the data from Amadeus has to be manipulated and some

datasets have to be excluded first before it can be analysed. This section pre-

sents such manipulations, which are carried out to enable data analysis. Further

manipulations which are related to key figures are mentioned in chapter 2. Be-

cause the used data is distributed over two database tables, it has to be merged.

The necessary steps are described in the appendix.

First of all, it has to be mentioned that only German companies are regarded

because of the setting of the task which means that all other companies are ex-

cluded. Besides that, only annual reports from the years 2007, 2008, 2009,

2010 and 2011 are regarded. There are more recent reports in the dataset, too,

but much less then for the mentioned five years. To ensure a certain representa-

tiveness of results, older data is accepted.

Furthermore, it is ensured that only those annual reports are considered

which cover exactly twelve months. There are a few reports in the database,

too, which summarise a different number of months. Such reports are not com-

parable to those which cover exactly one year. It also appears not to be sensible

to multiply the used key figures with a factor which could compensate a differ-

ent number of months. The reason is that the underlying assumption that costs

and earnings stay the same every month is in most cases not true because, for

instance, a tourist hotel in a ski-region usually earns more money and has also

higher costs during winter.

Additionally, it is ensured that no consolidated companies are extracted be-

cause annual reports of concerns and firms have completely different purposes

and it is not sensible to regard them simultaneously (Vorstius 2004, 26-27). In

this book, only annual reports of firms are regarded.

Besides that, the account practice has to be "Local GAAP" and not "IFRS".

It can lead to wrong results if firms using different account practices are com-

pared because they often calculate the same key figures using different rules

(Lembke 2007, 6-7). Because over 99.8 percent of all reports are based on

"Local GAAP" this accounting practice is selected. All the reports which are

based on "IFRS" are excluded, too.

All the annual reports which do not contain any key figure values at all are

also not part of the final dataset. For the actual analysis, all companies are not

considered either which do not have a lucrative-value for the prediction year.

4. Key

figure

selection

In the previous chapters, it is explained what a key figure analysis is and what

kind of data is available. This chapter elucidates which criteria appropriate key

figures have to meet and which key figures are used for the analysis.

Like the previous chapter, this chapter is also part of the Data Understand-

ing phase.

4.1 Significant key figure requirements

In section 2.4 it is illustrated that the key figure analysis of annual reports has

several disadvantages, which reduce the meaningfulness of its results. To ad-

dress these problems, several scientists introduced a few requirements which

are introduced in this chapter and are taken into account in the next section.

Generally speaking, the intention of these requirements is to identify a set of

key indicators, which are, by trend, either higher or lower for intensively grow-

ing companies than for not intensively growing companies (Pytlik 1994, 234).

Moreover, the key indicator itself or all the features which are necessary for its

calculation have to exist in the available dataset.

The first and probably most obvious requirement is to use relatively recent

key figures. Pytlik points out that the accuracy of the classification decreases if

the used key figures are too old (1994, 94).

Another criterion which is also related to time is only to regard key indica-

tors of the same space of time. The reason for that is that external effects like

crises and booms often have a strong impact on annual reports so that it is not

allowed to compare such a key figure with one of a "normal" economic situa-

tion (Pytlik 1994, 230).

Furthermore, it is important not to regard similar or redundant key indica-

tors at the same time. This fosters the identification of relationships which do

not exist "in reality" but exist in the dataset only by chance. Moreover, this

slows down the execution of the data mining method. Examples of redundancy

are the consideration of very similar key indicators or of a key indicator and its

reciprocal simultaneously (Pytlik 1994, 234).

Additionally, key indicators whose numerator and denominator can be both

positive and negative have to be excluded because their value is hard to inter-

pret. This worsens prediction results. For instance, a positive value can be a re-

sult of a division of two positive or two negative values which is oftentimes a

big difference (Pytlik 1994, 234).

Like explained in the appendix, key indicators, the denominators of which

can have the value zero, have to be rejected. Alternatively, the data entries, the

denominator of which has the value zero, have to be excluded.

It is important to point out that although it is often advisable to work with

fractions because they are easier to compare with the values of other enterpris-

es, the usage of absolute values can still be reasonable, too (Küting und Weber

1994, 24).

4.2 The selected key figures of this analysis

After all the key figure requirements have been presented, the selection of the

actual key indicators is justified. In general, there are two different approaches

in literature how to select key figures. The first approach is to analyse a huge

number of different key figures even if it is not always clear in advance why

every single of them should be a good predictor for the dependent variable.

The other approach is to select a relatively small number of key figures

(Löbbe 2001, 158). Each of them

x should either be chosen because there is a reason to assume that this key

figure predicts the correct class because of its meaning,

x or should already have proven its effectiveness in earlier studies,

x or should be considered to be important by scientists,

x or should have a high significance for practical applications (Pytlik 1994,

233).

In this study, the second approach is chosen, because although the first ap-

proach might identify certain relationships, which could be missed by selecting

key figures manually, it has the disadvantage that it also might present reputed

links which do not exist "in reality" but only exist in the dataset by chance.

Moreover, this method leads the scientist into temptation to whitewash his re-

sults in retrospect.

To structure this section, the chosen class variable which identifies whether

a company is lucrative or not is selected. It is meaningful to start with this class

variable because it is also used later on as a predictor. By starting with the class

variable, it does not have to be explained several times. After that, the other

key indicators are presented.

4.2.1 Selected class variable

The class variable should indicate that a company has grown stronger-than-

average in the regarded year. To identify an appropriate class variable, sugges-

tions from literature are taken into account and it is, furthermore, made sure

that only about five percent of all companies, whose annual reports contain all

the data to calculate the class variable, are regarded as intensively growing.

Although this limit is arbitrary, it should ensure that, for instance, future politi-

cians and investors only invest in highly lucrative companies.

The following criteria have to be met by a firm at the same time in the re-

garded year so that it is considered to be exceptionally growing (

Ælucra-

tive=TRUE):

Return on Equity (ROE) in the current year 0%

Absolute increase of ROE compared to previous year 5%

Turnover in relation to previous year 130%

If all necessary values exist in the dataset but at least one of these criteria is

not met, the company is regarded as not lucrative (

Ælucrative=FALSE). If

there is missing data so that the three key figures cannot be calculated, the var-

iable is initiated as not available (NA). To exclude outliers, the following

scheme is applied: If a firm meets all the requirements of a lucrative firm and

at least one of the following requirements, it is regarded as an outlier and its

class variable is also initiated as NA:

Return on Equity (ROE) in the current year > 200%

Absolute increase of ROE compared to previous year >100%

Turnover in relation to previous year > 300%

These values are also arbitrary to a certain degree but have been identified

by looking at the distributions of these key figures. It should be at least plausi-

ble that it is highly unlikely (but still possible) that a firm can more than triple

its turnover within one year.

The first considered key factor is Return on equity.

ROE shows the yield of shareholders' capital. A low value means that the

company does not draw enough profit or that it has assets which generate costs

but are no longer required. Another explanation is a very high amount of stocks

and, therefore, a high capital commitment. A high value increases the firm's at-

tractiveness for new investors and indicates an adequate combination of assets

which is important for future profits (Deimel, Heupel und Wiltinger 2013, 195-

196). Because of this strong orientation towards future this key figure is chosen

as one determinant of the class variable.

Because of a similar reason, the second variable is chosen which is also

based on ROE but rather regards the development of this value compared to the

previous year:

A positive ROE_Incr can mean that either the company's profit has in-

creased or that the company was able to dispose of unnecessary assets and,

therefore, reduce fixed costs or both reasons. This leads to an increase of the

firm's attractiveness and sustainability.

The last considered determinant is the turnover ratio:

The turnover is the sum of all sales of own products (Dumke 1996, 244).

There are many authors who recommend using the turnover ratio to determine

growing companies. One reason for that is that a value of this key figure, which

is bigger than 100 percent, means that there is an increasing demand for the

company's products. This means that the company will most likely expand in

order to be able to accommodate demand. In literature, values between 120 and

130 percent are proposed to identify growing companies (Harms 2004, 13,

Moog 2004, 2). In this study, the upper limit is chosen to confirm this study's

commitment to identify very lucrative companies.

To get a picture of the class variable's distribution, some important parame-

ters of this variable are provided in Table 1.

Key figure

% of NAs

Mode

Adj. gini ratio

lucrative11 90.25%

FALSE

(94%)

0.23

lucrative10 90.70%

FALSE

(92%)

0.28

Table 1: Important parameters of the class variable

Because the lucrative-variables of the years 2008 and 2009 are used as predic-

tors, their parameters are provided in the next section.

The first column of Table 1 contains the name of the key indicator. The

second column tells how many German companies do not have a value for this

Details

Pages
Type of Edition: Erstausgabe
Publication Year: 2014
ISBN (Softcover): 9783954893041
ISBN (eBook): 9783954898046
File size: 2.3 MB
Language: English
Publication date: 2014 (August)
Keywords: prediction data mining
Product Safety: Anchor Academic Publishing

Author

Jurij Weinblat (Author)

Prediction of highly lucrative companies using annual statements: A Data Mining based approach

Summary

Excerpt

Table Of Contents

Details

Author

Jurij Weinblat (Author)