Loading...

Prediction of highly lucrative companies using annual statements: A Data Mining based approach

©2014 Textbook 98 Pages

Summary

The intention of this study is to predict one year in advance whether a regarded firm will grow extraordinarily in the next year. This is crucial for private investors and fund managers who need to decide whether they should invest in a certain firm. Companies like Apple and Amazon have shown that people who recognized the potential of such companies at the right time earned a lot of money.<br>The applied prediction models can also be used by politicians to identify companies which are eligible for funding, because growing companies oftentimes hire many employees.<br>Since annual reports are often publically available for free, it is reasonable to take advantage of them for such a prediction. The prediction models are based on classification trees and forests because they have some very substantial advantages over other methods like neural networks, which are frequently used in literature. For instance, they do not have distributional assumptions, accept both quantitative and qualitative inputs, and are not sensitive with respect to outliers. Furthermore, they are easy to understand by humans and can deal with missing values, which is crucial for practical applications.

Excerpt

Table Of Contents


VIII
List of tables
T
ABLE
1:
I
MPORTANT PARAMETERS OF THE CLASS VARIABLE
... 27
T
ABLE
2:
S
ELECTED QUALITATIVE KEY FIGURES
... 29
T
ABLE
3:
S
ELECTED ABSOLUTE KEY FIGURES
... 31
T
ABLE
4:
S
ELECTED RELATIVE KEY FIGURES
... 36
T
ABLE
5:
F
OUR ENTRIES OF THE EXAMPLE DATASET
... 43
T
ABLE
6:
C
ROSS VALIDATED CLASSIFICATION TREE RESULTS FOR
2010 ... 62
T
ABLE
7:
RPART'
S PREDICTION OF LUCRATIVENESS FOR
2011... 65
T
ABLE
8:
R
ESULTS FOR THE REDUCED DATASET
... 67
T
ABLE
9:
C
ROSS VALIDATED CLASSIFICATION FOREST RESULTS FOR
2010 ... 68
T
ABLE
10:
R
ANDOM FOREST
'
S PREDICTION OF LUCRATIVENESS FOR
2011 ... 69
T
ABLE
11:
R
ESULTS OF THE MEASURES TO IMPROVE THE PRECISION OF THE FORESTS
... 70

IX
List of abbreviations
BvD ... Bureau van Dijk Electronic Publishing GmbH
CART ... Classification and Regression Trees
csv-file ... Comma-separated-values-file
CV ... Cross validation
DM ... Data Mining
FN ... False negative
FP ... False positive
IQR ... Interquartile range
NA ... Not available
RF ... Random forest
RM ... Reference model
ROE ... Return on equity
SQL ... Structured Query Language
TN ... True negative
TP ... True positive

X
Acknowledgment
I would also like to thank my supervisor Prof. Dr. Andreas Behr for assisting
me with this book. He provided me with valuable suggestions and gave me the
opportunity to write about my favourite topic. Thank you very much!
I would also like to say thank you to my parents and my girlfriend Sarah for
their support during my entire work.

11
1. Introduction and problem description
In literature, a lot of scientists describe how to use annual report data to predict
whether a certain company is going to become bankrupt (Dimitras, Zanakis
und Zopounidis 1996, 487­513). The reasons why this topic attracts such a
high degree of scientific attention is rather obvious: The stability of the finan-
cial system depends on the ability of banks and other financial service provid-
ers to assess whether a certain firm will be able repay a loan or not. Further-
more, banks need this information to be able to calculate an adequate probabil-
ity of default to identify a minimum interest rate for a concrete loan (Moro und
Schäfer 2004).
Nevertheless, it is not only relevant to anticipate this worst case of bank-
ruptcy, but also whether a regarded small firm will grow extraordinary in the
next year and maybe even become a big company in the medium term. This is
crucial information for private investors and fund managers who need to decide
whether they should invest in a certain firm. Companies like Apple and Ama-
zon have shown in the past that people who recognized the potential of such
companies and bought their shares have earned a lot of money.
The prediction models, which are described in this paper, can also be used
by politicians to identify companies which are eligible for funding. Because
growing companies oftentimes hire many employees, it might be meaningful to
facilitate their development process by selective subsidies to reduce unem-
ployment. Furthermore, it is possible to question the prediction results of a fi-
nancial analyst if he came to a different conclusion than a model.
Since annual reports are often publically available for free, it is reasonable
to take advantage of them for such a prediction (Gräfer 1988, 52). Additional-
ly, various information providers maintain huge databases with annual reports.
A big data approach promises to further improve accuracy of predictions
(Rauscher und Rockel 2001, 5). This paper introduces methods, which enable
to generate knowledge out of these huge data sources to identify extraordinary
lucrative firms.
To generate these prediction models, a data mining approach is used which
is based on the approved CRISP-DM proceeding model for data mining pro-
cesses. CRISP-DM ensures comparability and the consideration of best prac-
tices (Chapman, et al. 2000, 1-2). The prediction models are based on classifi-
cation trees and forests because they have some very substantial advantages

12
over other methods like neural networks, which are frequently used in litera-
ture. For instance, the underlying algorithms of the used model do not require a
certain distributional assumption, accept both quantitative and qualitative in-
puts, and are not sensitive with respect to outliers. But the two most important
advantages are that a tree can be easily interpreted by users which is important
for the previously described stakeholders because it is not easy to trust the re-
sults of a model which one does not understand (Löbbe 2001, 199). This is why
a lack of understanding might impede the practical implementation of such a
model. Besides that, the used algorithms can handle missing data which occur
very often in the available dataset. In other analysis, these data entries would
have been removed even if only one value is missing. This reduces the often al-
ready relatively small amount of available data and can reduce the model's ac-
curacy (Neeb 2011, 67, Franken 2007, 5). This is not the case for the applied
methods.
1.1 Intention of this study
The intention of this paper is to determine whether a stakeholder can use a
classification tree or classification forest at the beginning of one year to identi-
fy German firms which will grow exceptionally in this year using annual re-
ports' key figures from previous years. As a first step, key figures from the
years 2007, 2008 and 2009 are used to generate different trees and forests
which can predict whether a company grows outstandingly in 2010 or not.
These models require the lucrativeness information from 2010 to be generated.
To evaluate how well these unchanged models would work for the mentioned
stakeholder at the beginning of the year 2011, they are also applied to data
from 2008, 2009 and 2010 as a second step. This means that this time, the
models are applied to more recent data to anticipate whether the regarded firms
will grow intensively in 2011. Data from 2011 is only used to check the predic-
tions' correctness and not to generate models. The best identified models are
also compared and analysed.
These four particular years have been chosen because the available dataset
"Amadeus" only contains a relatively small amount of more recent data. It is
probably not necessary to regard more than three years for the generation of
these models because it is shown in literature that this data is not able to no-
ticeably improve prediction (Pytlik 1994, 94).

13
One important characteristic of this paper is the usage of the CRISP-DM
model, which is frequently used by data analysts and helps them to easier un-
derstand the analysis. Furthermore, this model encompasses important best
practices which could otherwise be overseen.
Another distinctive feature of this analysis is that the used dataset has a total
size of about 18 gigabytes. The reason why it is so huge is that it contains more
than 24 million entries each of which contains up to 158 attributes of over 3
million different European companies. Such dimensions are untypical for such
analysis
1
and overburden frequently used software like R. Furthermore, this
analysis overstrains most current desktop computers because they do not have
enough main memory for it. Besides that, this dataset does not have a typical
database structure. This is why this analysis meets two important criteria (size
and complicated structure) of Big Data-analysis (IBM Corporation Software
Group).
1.2 Proceeding
To reach all these goals, the following proceeding is chosen. Since this Data
Mining analysis is based on key figures from annual reports, chapter 2 de-
scribes some main principles of such an analysis, why it is more powerful than
other analysis using older techniques and which drawbacks the usage of annual
statement has in general. This section provides some explanations why the
generated models sometimes fail to do a correct prediction.
Chapter 1 presents the used dataset and which requirements its entries have
to meet in order to be analysed in this book. It is, moreover, explained in the
appendix how to solve some of its structural problems.
After the available data is described, chapter 1 presents and elucidates all
the qualitative and quantitative key figures which are used to predict growth.
These key figures have to meet certain demands, to be meaningful which are
presented as well. Furthermore, it depicts how it is determined whether a com-
pany is lucrative and has, therefore, grown outstandingly or not because there
are several different possibilities to do that. Based on these key figures, a first
1
Kumar and Ravi have shown in their review that the majority of bankruptcy prediction stud-
ies do not analyse more than 9000 firms simultaneously (2007, 6-7).

14
analysis is carried out to find differences between lucrative and not lucrative
companies.
The next chapter illustrates why classification trees and forests are used in
this book and which software is used to generate them. The methods are ex-
plained based on a simple example which is presented, too.
Chapter 1 contains the actual analysis and a comparison of the different re-
sults. The obtained findins are summarised in chapter 1 and a conclusion is
drawn.

15
2. Introduction to key figure analysis
Because the main part of this book is an analysis of annual statements' key fig-
ures, it is necessary to explain what key figures are in general and what ad-
vantages and shortcomings such an analysis has. This explanation can be found
in this chapter. Moreover, the assumptions and risks of the data analysis are
presented which are crucial for the entire project.
This chapter is part of the Business Understanding phase because it explains
the context of the upcoming data analysis. Some further aspects of this phase
like the target group of this analysis have already been mentioned in the intro-
duction to meet the structure of a scientific paper.
2.1 The principle of key figures
A key figure is a condensed indicator for a certain quantifiable issue. It pro-
vides information in a way that the beholder gets a quick overview of the most
important aspects of this issue and it points out abnormalities (Pook und Tebbe
2002, 104). In this study, such a figure has to inform about the economical
state of the regarded enterprise.
Every key indicator should be designed in a way that it has a clear meaning
because, in theory, it is of course possible to put arbitrary figures in the numer-
ator and denominator of a key figure (März 1983, 80). But even the value of a
well-designed key figure often does not have an own semantic but only gets a
meaning when this value is compared to the value of another company or to a
certain reference value (Johnson 1970, 1167). Furthermore, it is advisable to
look at the development of certain key indicators (Pytlik 1994, 98).
It is important to mention that there are both absolute key figures, relative
key figures, and proportional key figures (Mittag 2011, 73).
Several key figures can be combined to a so called "key figure system"
which aims for representing managerial interdependencies and certain external
influences (Löbbe 2001, 24). Such a system can be used to compare several en-
terprises even if they are different in some respects (Schult 2003, 15).
2.2 The classical key figure analysis approach
Such a key figure system can be used to analyse the economical state of a
company. To do this, the enterprise is assessed based on subcategories like as-
set structure, rentability, and liquidity. All necessary key figures for such an as-

16
sessment are calculated based on the figures, which are published by the com-
pany. The results of these subcategories are then combined to an overall result
(Löbbe 2001, 23, Franken 2007, 3-11). Such an analysis should also provide
information about how rich or poor a firm is, why and how much its assets
have changed, and how successful it will be in the future (Löbbe 2001, 35,
Franken 2007, 3).
This depiction creates the impression that such an approach can be used in
this paper to predict whether a company is going to realise profit or to incur a
loss. But this is not the case since these techniques have many shortcomings,
which are summarised in the following paragraph.
These classical approaches try to conclude the current state of an enterprise
from certain key figures (Löbbe 2001, 34). This is very problematic because
there are no proven theories which could enable such a deductive reasoning but
only evidence about certain interdependencies. This is why an analyst has to
make a lot of assumptions about which key figures to look at, how strong the
impact of each of them is, and how to combine the results of the subcategories
to an overall assessment. Furthermore, it is in most cases ambiguous whether a
certain value of the regarded figure is actually a "good" or a "bad" sign. This
makes such an approval highly subjective. Besides that, using not enough fig-
ures reduce the semantic of the key figure system as some important aspects
are not considered. Using a too big system makes it very hard to get a quick
overview. Because of that, identifying an appropriate number of regarded fig-
ures is also not trivial (Löbbe 2001, 33-46, Hauschildt und Baetge 2000, 115,
Küting und Weber 1994, 342).
But there are many other problems, too. One of them is that the results of
these classical approaches are often not precise enough (Moro und Schäfer
2004). Moreover, such judgements require a lot of time and generate relatively
high costs because they almost do not benefit from modern information pro-
cessing (Nanni und Lumini 2009, 3028).
Because of this lack of a theoretical foundation, the insufficient precision
and the high cost, these approaches are not used in this book.
2.3 Modern key figure analysis approaches
All these disadvantages of the previously mentioned approaches motivated sci-
entists to develop new kinds of methods to make predictions based on key fig-

17
ures from annual statements. The ongoing evolution of digital information pro-
cessing, which enables the practical application of these methods, is also a very
important reason why the significance of these methods continues to increase
(Löbbe 2001, 35).
Because there is no proven theory about the dependencies between key fig-
ures and the state of the corresponding enterprise, this approach encompasses
various data mining methods.
Data Mining (DM) is both the science and art of intelligent data analysis,
which aims for gaining insights into the data and for learning about interesting
patterns and trends (Williams 2011, VII, Hastie, Tibshirani und Friedman
2009, 8, Han, Kamber und Pei 2011, 8). A pattern is usually regarded as rele-
vant if it is universally valid, not already known by the user, and is useful and
understandable for him. Such relevant patterns are regarded as knowledge
(Runkler 2010, 2).
The identified knowledge is often represented as models, which are a struc-
tured representation of the underlying data. Models are sometimes also called
"learners". They can, further on, be used for predictions or to learn more about
the data (Williams 2011, 3-4, Hastie, Tibshirani und Friedman 2009, 20-21).
DM was introduced by the database community in the 1980s and is now al-
so advanced by statisticians and artificial intelligence scientists (Williams
2011, VII). Statistics added various computational methods and visualisation
techniques to DM. Artificial intelligence contributed its focus on heuristics,
and the database experts provided the knowledge how to efficiently store and
access large amounts of data which have to be analysed (Gorunescu 2011, 2-3).
Nowadays, different kinds of data like data from social media, patient data
and data from the retail industry and science was collected (Han, Kamber und
Pei 2011, 2). DM methods can be used to analyse this data and to predict heart
attacks, identify cancer, anticipate share prices, and recognize spam emails
(Hastie, Tibshirani und Friedman 2009, 20-21).
There are several different approaches how to categorize DM methods. One
of them is presented here. The first category is "characterization and discrimi-
nation", where the properties of certain user defined classes should be ana-
lysed. In the grouping "mining frequent patterns", item sets and patterns which
occur frequently within the data are identified. In case of "classification and
regression", the classes or a certain target value of not yet classified objects

18
have to be determined. These methods require a certain amount of already clas-
sified objects to determine the classification model. Methods of "cluster analy-
sis" try to identify objects which belong together based on similarity considera-
tions when no class information exists in advance. Moreover, there are also
methods for the detection of outliers. These are objects, which are very differ-
ent from most of the other objects (Han, Kamber und Pei 2011, 15-21).
The analysis of this study is a classification task. Firms are classified as
firms which will grow intensively (=class 1) or will not grow or will even
shrink (=class 2). The right class is not known in advance and is determined
based on concrete annual report data of the regarded enterprises from previous
years (Anders und Szczesny 1999, 3). The corresponding DM methods cannot
give reasons for the underlying observations but can be used for predictions if
the assumption is true that the identified trends or patterns stay valid up to the
prediction point (Löbbe 2001, 34, Franken 2007, 1). Besides that, the used
methods enable to get an impression about the quality of the generated results
(Hauschildt und Baetge 2000, 115). The classical approach does not offer this
possibility.
Edward I. Altman was the first scientist, who used such a modern approach.
He applied the multiple discriminant function analysis to annual report data
(Löbbe 2001, 46). Because of this method's very restrictive assumptions on
linear separability, multivariate normality and independence of the predictive
variables, other authors have applied other methods to this kind of data
(Chandra, Ravi und Bose 2009, 4831). Examples of other used data mining
methods are neural networks, decision trees, and support vector machines
(Kumar und Ravi 2007, 4-13).
An important advantage of such a data mining approach is that they meet
the principles of the analysis of annual statements: The results meet the objecti-
fication principle because they are generated based on empirical data. They al-
so meet the neutralisation principle because the importance of each key figure
is determined by the used method. Last but not least, they meet the holism
principle since both the assets, finances, and yields are taken into account
(Baetge und Henning 2008, 279).
Furthermore, it is important to point out that it is possible to combine the re-
sults of the classical and the data mining approaches to benefit from all of their
advantages simultaneously.

19
Since the classification and prediction model is now built based on given
data the quality of this data directly influences the quality of this model and has
to be taken into account (Löbbe 2001, 137).
It is important to mention that DM is not just a collective term for various
data analysis methods but describes an entire process which is carried out as a
project. In such a project, DM experts, data experts, and domain experts have
to collaborate to bring together the knowledge how to analyse data, how to ac-
cess the data, and how to understand the data's semantic. Moreover, the actual
target of the DM project and the intended proceeding is often not clear at the
beginning and is often specified based on first results. Even after the proceed-
ing and the targets are specified, it is often necessary to return to previous stag-
es because of certain new insights. Furthermore, several models are created,
tested and improved in the course of the project until a satisfactory perfor-
mance is achieved (Williams 2011, 5-8, Runkler 2010, 3).
The CRISP-DM reference model is the most common one and encompasses
plenty of best practices. To benefit from these best practices, this model is con-
sidered in this book. The model's description can be found in the appendix.
2.4 Limitations of annual report analysis
At the end of this chapter it is important to point out important general as-
pects of analysing annual statement data because these facts directly influence
the quality of the created model.
First of all, annual reports are not originally designed to be used as a foun-
dation for predicting growth but rather concern the past by telling how wealthy
the company is and why its assets has changes. This means that the annual re-
port is diverted from its intended use (Franken 2007, 3).
Another problem, especially in context of small and middle-size companies,
is that their success strongly depends on the manager of this company. Unfor-
tunately, most used datasets do not contain any information like age, gender
and education of this person (Anders und Szczesny 1999, 1-2).
Furthermore, there is often no information about the enterprise's strategic
goals, its capability to be innovative, the professionalism of the manager and
his staff, and the customer focus. All these aspects influence whether a compa-
ny is going to be successful but cannot be used because they are either not

20
available at all or very hard to operationalize and, therefore, require controver-
sial generalisations (Moro und Schäfer 2004, Fritz 1993, 1, Feldo 2011, 8).
But even the available information cannot be regarded as objective which
influences the informational value of the key figures as well. The reason for
this is that the companies have a certain level of autonomy of decision as far as
the calculation of certain values is concerned so that two identical companies
can legally create different annual statements. At least some companies take
advantage of this to create their annual statement in such a way that they have
to pay less taxes (Löbbe 2001, 43, Rauscher und Rockel 2001). Besides that,
annual statements are not instantly available at the beginning of a year so that
analysts have to wait until they can use this information for prediction. If they
need the outcomes of their predictions earlier, they have to rely on older data.
This degrades the accuracy of their prediction (Löbbe 2001, 43).
But there is another kind of problem, too, which is caused by rather mathe-
matical reasons. One of them is that even if all required values are available,
some key indicators cannot be calculated because its denominator has the value
zero. In huge datasets, this most likely occurs a few times so that these firms
have to be removed, too (Löbbe 2001, 138). Moreover, the same value of the
same key indicator can be a result of completely different initial values which
are divided by each other. For example, both 2/4 and 333/666 have the same
result 0.5 which, on the one hand, makes it possible to compare completely dif-
ferent enterprises as mentioned before but, on the other hand, makes it compli-
cated to conclude certain properties of the firm from such a division result
(Franken 2007, 9).
Despite all these shortcomings of annual statements, Gräfer still points out
that it is meaningful to use them for prediction purposes because they are often
the only publically available source of information and still contain a lot of
useful data (1988, 52).

3
In
th
of
3
Th
in
fo
fo
an
in
th
ea
co
es
va
ue
ta
1.
(8
ap
dr
du
. The
av
n this chapte
he content a
f the Data U
.1 Desc
he dataset o
ng GmbH" (
ormation pro
or analysis a
The datas
nd compani
ncorporated
hermore, thi
astern and
ompanies ar
specially the
an Dijk Elec
Amadeus
es are enclo
abulator cha
.
In this ana
86 features)
pproximatel
resses of th
ustry they o
vailable d
er, the used
and the stru
Understandin
cription o
originates fr
(BvD). BvD
oviders, com
and research
et contains
ies which ar
firms (Bur
is dataset, w
western Eu
re inside Am
e annual re
ctronic Pub
is stored in
osed in quo
aracters. The
Illustr
alysis, only
and finance
ly nine giga
he regarded
operate, whi
dataset
d dataset of
ucture of the
ng phase of
of the da
rom the com
D obtains d
mbines this
h purposes.
both comp
re not or no
reau van Dij
which is cal
urope. In to
madeus. To
eport data w
lishing Gmb
n five Comm
otation mark
e structure
ration 1: Stru
y two of the
e data (72 f
abytes. The
companies
ich importa
this paper i
e dataset ar
f CRISP-DM
ataset
mpany ,,Bur
digitalised d
data, and p
BvD also c
panies whic
o longer lis
jk Electron
lled "Amad
otal, approx
o enable com
was collecte
mbH 2013).
ma-separate
ks, and con
of such a c
ucture of the a
e five csv-fi
features). Ea
master file
. Additiona
ant trademar
is described
re illustrated
M.
reau van Dij
data about c
provides this
collects som
ch are listed
sted with an
nic Publishin
deus", encom
ximately th
mparisons o
ed in a stand
ed-values fil
nsecutive va
sv-file can
available csv-
iles are requ
ach of these
e data conta
ally, it is m
rks they pos
d. In this co
d. This chap
ijk Electron
companies f
s data to its
me of its data
d at a stock
n emphasis
ng GmbH 2
mpasses rec
hree Million
of internatio
dardised wa
les (csv-file
alues are se
be seen in I
-file
uired: maste
e files has a
ins the nam
mentioned in
ssess, and w
21
ntext, both
pter is part
nic Publish-
from its in-
customers
a by itself.
k exchange
upon non-
2013). Fur-
cords from
n different
onal firms,
ay (Bureau
es). Its val-
eparated by
Illustration
er file data
file size of
mes and ad-
n which in-
where most

22
of their goods are produced. As it can be seen in Illustration 1, there are often-
times more than one row for the same company. This seems to be the case if
the corresponding feature is a descriptive feature and, therefore, has more than
one value for this company at the same time (Bol 2004, 16). For instance, this
is the case if a company has changed its name several times and consequently
has more than one former name. In these cases, only the first row is complete
and all the other rows just contain the same "BvD ID number", company name,
and the additional feature characteristics. Such a file structure enables to avoid
redundancy and to reduce the file size.
The finance dataset contains the actual annual reports. Every row represents
exactly one report the date of which is saved in the column "Account date".
Other characteristic features are the gross profit, the number of employees and
the costs of materials.
Another very important column is the already mentioned "BvD ID number",
which is unique for every company and enables to merge data from several
csv-files. If, for instance, the user requires the industry code for a given annual
report, he just has to go through the master file data and look for the first row
which has the same "BvD ID number" as the annual report.
3.2 Data
clean-up
Like in most databases the data from Amadeus has to be manipulated and some
datasets have to be excluded first before it can be analysed. This section pre-
sents such manipulations, which are carried out to enable data analysis. Further
manipulations which are related to key figures are mentioned in chapter 2. Be-
cause the used data is distributed over two database tables, it has to be merged.
The necessary steps are described in the appendix.
First of all, it has to be mentioned that only German companies are regarded
because of the setting of the task which means that all other companies are ex-
cluded. Besides that, only annual reports from the years 2007, 2008, 2009,
2010 and 2011 are regarded. There are more recent reports in the dataset, too,
but much less then for the mentioned five years. To ensure a certain representa-
tiveness of results, older data is accepted.
Furthermore, it is ensured that only those annual reports are considered
which cover exactly twelve months. There are a few reports in the database,
too, which summarise a different number of months. Such reports are not com-

23
parable to those which cover exactly one year. It also appears not to be sensible
to multiply the used key figures with a factor which could compensate a differ-
ent number of months. The reason is that the underlying assumption that costs
and earnings stay the same every month is in most cases not true because, for
instance, a tourist hotel in a ski-region usually earns more money and has also
higher costs during winter.
Additionally, it is ensured that no consolidated companies are extracted be-
cause annual reports of concerns and firms have completely different purposes
and it is not sensible to regard them simultaneously (Vorstius 2004, 26-27). In
this book, only annual reports of firms are regarded.
Besides that, the account practice has to be "Local GAAP" and not "IFRS".
It can lead to wrong results if firms using different account practices are com-
pared because they often calculate the same key figures using different rules
(Lembke 2007, 6-7). Because over 99.8 percent of all reports are based on
"Local GAAP" this accounting practice is selected. All the reports which are
based on "IFRS" are excluded, too.
All the annual reports which do not contain any key figure values at all are
also not part of the final dataset. For the actual analysis, all companies are not
considered either which do not have a lucrative-value for the prediction year.

24
4. Key
figure
selection
In the previous chapters, it is explained what a key figure analysis is and what
kind of data is available. This chapter elucidates which criteria appropriate key
figures have to meet and which key figures are used for the analysis.
Like the previous chapter, this chapter is also part of the Data Understand-
ing phase.
4.1 Significant key figure requirements
In section 2.4 it is illustrated that the key figure analysis of annual reports has
several disadvantages, which reduce the meaningfulness of its results. To ad-
dress these problems, several scientists introduced a few requirements which
are introduced in this chapter and are taken into account in the next section.
Generally speaking, the intention of these requirements is to identify a set of
key indicators, which are, by trend, either higher or lower for intensively grow-
ing companies than for not intensively growing companies (Pytlik 1994, 234).
Moreover, the key indicator itself or all the features which are necessary for its
calculation have to exist in the available dataset.
The first and probably most obvious requirement is to use relatively recent
key figures. Pytlik points out that the accuracy of the classification decreases if
the used key figures are too old (1994, 94).
Another criterion which is also related to time is only to regard key indica-
tors of the same space of time. The reason for that is that external effects like
crises and booms often have a strong impact on annual reports so that it is not
allowed to compare such a key figure with one of a "normal" economic situa-
tion (Pytlik 1994, 230).
Furthermore, it is important not to regard similar or redundant key indica-
tors at the same time. This fosters the identification of relationships which do
not exist "in reality" but exist in the dataset only by chance. Moreover, this
slows down the execution of the data mining method. Examples of redundancy
are the consideration of very similar key indicators or of a key indicator and its
reciprocal simultaneously (Pytlik 1994, 234).
Additionally, key indicators whose numerator and denominator can be both
positive and negative have to be excluded because their value is hard to inter-
pret. This worsens prediction results. For instance, a positive value can be a re-

25
sult of a division of two positive or two negative values which is oftentimes a
big difference (Pytlik 1994, 234).
Like explained in the appendix, key indicators, the denominators of which
can have the value zero, have to be rejected. Alternatively, the data entries, the
denominator of which has the value zero, have to be excluded.
It is important to point out that although it is often advisable to work with
fractions because they are easier to compare with the values of other enterpris-
es, the usage of absolute values can still be reasonable, too (Küting und Weber
1994, 24).
4.2 The selected key figures of this analysis
After all the key figure requirements have been presented, the selection of the
actual key indicators is justified. In general, there are two different approaches
in literature how to select key figures. The first approach is to analyse a huge
number of different key figures even if it is not always clear in advance why
every single of them should be a good predictor for the dependent variable.
The other approach is to select a relatively small number of key figures
(Löbbe 2001, 158). Each of them
x should either be chosen because there is a reason to assume that this key
figure predicts the correct class because of its meaning,
x or should already have proven its effectiveness in earlier studies,
x or should be considered to be important by scientists,
x or should have a high significance for practical applications (Pytlik 1994,
233).
In this study, the second approach is chosen, because although the first ap-
proach might identify certain relationships, which could be missed by selecting
key figures manually, it has the disadvantage that it also might present reputed
links which do not exist "in reality" but only exist in the dataset by chance.
Moreover, this method leads the scientist into temptation to whitewash his re-
sults in retrospect.
To structure this section, the chosen class variable which identifies whether
a company is lucrative or not is selected. It is meaningful to start with this class
variable because it is also used later on as a predictor. By starting with the class
variable, it does not have to be explained several times. After that, the other
key indicators are presented.

26
4.2.1 Selected class variable
The class variable should indicate that a company has grown stronger-than-
average in the regarded year. To identify an appropriate class variable, sugges-
tions from literature are taken into account and it is, furthermore, made sure
that only about five percent of all companies, whose annual reports contain all
the data to calculate the class variable, are regarded as intensively growing.
Although this limit is arbitrary, it should ensure that, for instance, future politi-
cians and investors only invest in highly lucrative companies.
The following criteria have to be met by a firm at the same time in the re-
garded year so that it is considered to be exceptionally growing (
Ælucra-
tive=TRUE):
1.
Return on Equity (ROE) in the current year 0%
2.
Absolute increase of ROE compared to previous year 5%
3.
Turnover in relation to previous year 130%
If all necessary values exist in the dataset but at least one of these criteria is
not met, the company is regarded as not lucrative (
Ælucrative=FALSE). If
there is missing data so that the three key figures cannot be calculated, the var-
iable is initiated as not available (NA). To exclude outliers, the following
scheme is applied: If a firm meets all the requirements of a lucrative firm and
at least one of the following requirements, it is regarded as an outlier and its
class variable is also initiated as NA:
1
Return on Equity (ROE) in the current year > 200%
2
Absolute increase of ROE compared to previous year >100%
3
Turnover in relation to previous year > 300%
These values are also arbitrary to a certain degree but have been identified
by looking at the distributions of these key figures. It should be at least plausi-
ble that it is highly unlikely (but still possible) that a firm can more than triple
its turnover within one year.
The first considered key factor is Return on equity.
ROE shows the yield of shareholders' capital. A low value means that the
company does not draw enough profit or that it has assets which generate costs
but are no longer required. Another explanation is a very high amount of stocks
and, therefore, a high capital commitment. A high value increases the firm's at-

27
tractiveness for new investors and indicates an adequate combination of assets
which is important for future profits (Deimel, Heupel und Wiltinger 2013, 195-
196). Because of this strong orientation towards future this key figure is chosen
as one determinant of the class variable.
Because of a similar reason, the second variable is chosen which is also
based on ROE but rather regards the development of this value compared to the
previous year:
A positive ROE_Incr can mean that either the company's profit has in-
creased or that the company was able to dispose of unnecessary assets and,
therefore, reduce fixed costs or both reasons. This leads to an increase of the
firm's attractiveness and sustainability.
The last considered determinant is the turnover ratio:
The turnover is the sum of all sales of own products (Dumke 1996, 244).
There are many authors who recommend using the turnover ratio to determine
growing companies. One reason for that is that a value of this key figure, which
is bigger than 100 percent, means that there is an increasing demand for the
company's products. This means that the company will most likely expand in
order to be able to accommodate demand. In literature, values between 120 and
130 percent are proposed to identify growing companies (Harms 2004, 13,
Moog 2004, 2). In this study, the upper limit is chosen to confirm this study's
commitment to identify very lucrative companies.
To get a picture of the class variable's distribution, some important parame-
ters of this variable are provided in Table 1.
Key figure
% of NAs
Mode
Adj. gini ratio
lucrative11 90.25%
FALSE
(94%)
0.23
lucrative10 90.70%
FALSE
(92%)
0.28
Table 1: Important parameters of the class variable
Because the lucrative-variables of the years 2008 and 2009 are used as predic-
tors, their parameters are provided in the next section.
The first column of Table 1 contains the name of the key indicator. The
second column tells how many German companies do not have a value for this

Details

Pages
Type of Edition
Erstausgabe
Year
2014
ISBN (eBook)
9783954898046
ISBN (Softcover)
9783954893041
File size
2.3 MB
Language
English
Publication date
2014 (August)
Keywords
prediction data mining
Previous

Title: Prediction of highly lucrative companies using annual statements: A Data Mining based approach
book preview page numper 1
book preview page numper 2
book preview page numper 3
book preview page numper 4
book preview page numper 5
book preview page numper 6
book preview page numper 7
book preview page numper 8
book preview page numper 9
book preview page numper 10
book preview page numper 11
book preview page numper 12
book preview page numper 13
book preview page numper 14
book preview page numper 15
book preview page numper 16
book preview page numper 17
book preview page numper 18
book preview page numper 19
book preview page numper 20
98 pages
Cookie-Einstellungen