00001
00002 Semes - Concept Formation
00003 -------------------------
00004 Linas Vepstas May 2009
00005 Revised September 2009
00006
00007
00008 This directory contains some notes and experimental code for forming
00009 and extracting conceptual entities or "semes" from English text. An
00010 example of a "conceptual entity" would be the "Great Southern Railroad":
00011 a business, a railway, that existed at a certain point in space and time.
00012 Seme formation overlaps and combines two related tasks: "named entity
00013 extraction" and "reference resolution". The first, "named entity
00014 extraction" attempts to identify named objects. "Reference resolution"
00015 attempts to determine when two different words (usually in different
00016 sentences) refer to the same thing.
00017
00018 The primary challenges of seme extraction are:
00019 1) Constructing a data representation that is amenable to reasoning,
00020 and to question answering.
00021 2) Recognising when two different semes refer to the same concept,
00022 (so that they can be merged)
00023 3) Recognizing when one seme (incorrectly) refers to distinct concepts,
00024 and should be split apart.
00025 4) Learning new conceptual classifications, ontologies and relations;
00026 so, for example, when encountering a new, unknown, word, to determine
00027 that it is, for example, a previously unknown color name.
00028
00029 The text below is split into three parts: A motivational overview of
00030 "semes", a discussion of various issues that arise in concept formation,
00031 and, finally, details of the data structures and algorithms currently
00032 implemented in this directory.
00033
00034
00035 Part I -- Motivation -- Why semes?
00036 =======================-----------
00037 The notion of a seme is introduced here to solve a "simple" technical
00038 problem in knowledge representation. The goal of using "semes" is to
00039 move away from using words and/or word-instances to represent "things".
00040 Thus, "Mary's shoes" and "Tom's shoes" would be two different semes,
00041 because they are two different "things" (although both things are
00042 related to the word "shoe").
00043
00044 By contrast, "Mary had some shoes. The shoes were red." has two
00045 distinct word-instances: the word-instance "shoe" in the first sentence,
00046 and the word instance "shoe" in the second sentence. However, these
00047 two words stand for the same concept: "Mary's red shoes". The goal of
00048 using semes is to collapse these two word-instances into one seme,
00049 without getting tangled with the fact that these two different
00050 word-instances of "shoe" are involved in different syntactic relations
00051 in different sentences (... sentences that were possibly uttered by
00052 different individuals at different times, even!)
00053
00054 Thus, "semes", as defined here, are meant to be an abstraction that
00055 behaves much like "concepts", yet, in a certain way, behinving much
00056 like "words". This leaves the notion of "concepts" free for other,
00057 more abstract usage. Semes are meant to be fairly closely related to
00058 "words": they are only one small step towards the general goal of
00059 "conceptualization". In particular, semes are meant to be sufficiently
00060 word-like that they can be used in most relations that words are used
00061 in. so, for example, if there's a RelEx relation that connects two
00062 words, then one could have exactly the same structure connecting two
00063 semes.
00064
00065 Thus, "semes" are removed by only a small step from linguistic usage;
00066 they provide a needed abstraction on the road to true "concepts" and
00067 are just flexible enough to support basic tasks, such as (basic) entity
00068 identification, reference resolution, and (basic) reasoning.
00069
00070
00071 What is a Concept?
00072 ------------------
00073 So far, almost all the processing described in nlp/triples/README
00074 has been in terms of graph modifications performed on individual
00075 sentences, containing WordInstanceNodes, and links to WordNodes.
00076 In order to promote text input into concepts, and to reason with
00077 concepts, we need to define what a concept is, and where its boundaries
00078 extend to. At this point, the goal is not to define a super-abstract
00079 notion of a concept that passes all epistomological tests, but rather
00080 a practical, if flawed, data structure that is adequate for representing
00081 data learned by reading, learned through linguistic corpus analysis.
00082 The emphasis here is "flawed but practical": it should be just enough
00083 to take us to the next level of abstraction.
00084
00085 Naively, a concept would seem to have the following parts:
00086 -- a SemeNode, to serve as an anchor (could be ConceptNode)
00087 -- a linguistic expression complex.
00088 -- a WordNet sense tag (optional)
00089 -- a DBPedia URI tag (optional)
00090 -- an OpenCyc tag ... etc. you get the idea.
00091 -- basic ontological links -- is-a, has-a, part-of, etc.
00092 -- prepositional relations (next-to, inside-of, etc)
00093 -- Context tag(s). See section "Context" below.
00094
00095 What is a "linguistic expression complex"? It deals with the idea that
00096 most concepts are not expressible as single words: for example, "Mary
00097 had a red baloon". The head concept here is "baloon": it is an instance
00098 of the class of all baloons, and specifically, this instance is red.
00099 Thus, a "linguistic expression complex" would consist of:
00100
00101 -- a head WordNode, to give single, leading name.
00102 Possibly several WordNodes to give it multiple names?
00103 -- dependent modifier tags (e.g. "red")
00104 -- a part-of-speech tag, to provide a rough linguistic categorization
00105 -- a collection of disjunct tags, representing possible linguistic
00106 use of the WordNode to represent this concept.
00107
00108
00109 Promoting Words to Concepts
00110 ---------------------------
00111 Consider the task of promoting word-relations to concepts. Consider the
00112 following relationships:
00113
00114 is_a(bark, sound)
00115 part_of(bark, tree)
00116
00117 We know that these two relations refer to different senses of the word
00118 "bark". Yet, if these two are deduced by reading, how should the system
00119 recognize that two different concepts are at play? How should the
00120 self-consistency of a set of relations be assessed? Assuming that the
00121 input text is not intentionally lying, then, under what circumstances
00122 do a set of conflicting assertions require that the underlying word be
00123 recognized as embodying two different concepts?
00124
00125 One possible approach is to assign tentative WordNet-based word-senses
00126 using either the Mihalcea algorithm, or table-lookup from syntax-tagged
00127 senses (see the wsd-post/README for details). One nice aspect of WordNet
00128 tagging is that the built-in WordNet ontology can be used to double-check,
00129 strengthen certain sense assignments: this, for example:
00130
00131 bark%1:20:00:: has part-holonym tree_trunk%1:20:00::
00132
00133 while
00134
00135 bark%1:11:02:: has direct hypernym noise%1:11:00::
00136 and inherited hypernym sound%1:11:00::
00137
00138 Thus, triples that have been read in, and tagged with WordNet senses,
00139 can be verified against the WordNet ontology for the correctness of
00140 sense assignments. While this is a reasonable starting point, and
00141 gives an easy leg-up, it does not solve the more general problem of
00142 distinguishing and refining concepts.
00143
00144 Another approach is to use part-of-speech tags, and disjunct tags, as
00145 stand-ins for word senses. That is, the parser has already identified
00146 different word-instances according to their part-of-speech, and so at
00147 least a rough word-sense classification is available from that. That
00148 is, it is safe to assume that a noun and a verb never represent the
00149 same concept (at a certain level...). It has also been seen (see
00150 wsd-post/README) that the disjunct used during parsing has a high
00151 correlation with the word-sense; the disjunct used during parsing
00152 can be considered to be a very fine-grained part-of-speech tag. Thus,
00153 instead of using Wordnet sense tags as concept "nucleation centers",
00154 the disjuncts could be used as such.
00155
00156 Two distinct processes are at play: 1) recognizing that two different
00157 word instances refer to the same concept, and 2) recognizing that a
00158 previously learned concept should be refined into two distinct concepts.
00159 (For example, having learned the properties of a "pencil", one must
00160 recognize at some point that a "mechanical pencil" and a "wooden pencil"
00161 have many incompatbile properties, and thus the notion of a pencil must
00162 be split into these two new concepts).
00163
00164 The most direct route to either of these processes is by means of
00165 "consistency checking": using forward and backward chaining to determine
00166 whether two distinct statements are compatible with each other. When
00167 they are, then the two different word-instances can be assumed to refer
00168 to the same concept; relationships can then be merged.
00169
00170 Part II -- Issues
00171 =================
00172 A short discussion of issues that arise in concept/seme formation.
00173
00174 Context
00175 -------
00176 Almost all facts are contextual. You can't just say "John has a red
00177 ball" and promote that to a fact. You must presume a context of some
00178 sort: "Someone said during an IRC chat that John has a red ball", or,
00179 "While reading Emily Bronte, I learned that John had a red ball." The
00180 context is needed for two reasons:
00181
00182 1) When obtaining additional info within the same context, it is
00183 simpler/safer to deduce references, e.g. that the John in the second
00184 sentence is the same John as in the first sentence.
00185
00186 2) When obtaining additional info within a different context, it is
00187 simpler/safer to assume that references are distinct: that, for example,
00188 "John" an an Emily Dickinson novel is not the same "John" in an Emily
00189 Bronte novel.
00190
00191 Thus, it makes sense to tag recently formed SemeNodes with a context tag.
00192
00193
00194 A priori vs. Deduced Knowledge
00195 ------------------------------
00196 Consider the following:
00197
00198 capital_of(Germany, Berlin)
00199
00200 This triple references a lot of a-priori knowledge. We know that
00201 capitals are cities; thus there is a strong temptation to write a
00202 processing rule such as "IF ($var0,capital) THEN ($var0,city)".
00203 Similarly, one has a-priori knowledge that things which have capitals
00204 are political states, and so one is tempted to write a rule asserting
00205 this: "IF (capital_of($var0, $var1)) THEN political_state($var1)".
00206
00207 A current working assumption of what follows is that the various rules
00208 will/should encode a minimum of a-priori "real-world" knowledge.
00209 Instead, the goal here is to create a system that can learn, deduce
00210 such "real-world" knowledge.
00211
00212
00213 Definite vs. Indefinite
00214 -----------------------
00215 There is a subtle semantic difference between triples that describe
00216 definite properties, vs. triples that describe generic properites,
00217 or semantic classes. Thus, for example, "color_of(sky,blue)" seems
00218 unambiguous: this is because we know that the sky can only ever have
00219 one color (well, unless you are looking at a sunset). Consider
00220 "form_of(clothing, skirt)": this asserts that a skirt is a form_of
00221 clothing, and not that clothing is always a skirt. The form_of
00222 indicates a semantic category. Similarly, "group_of(people, family)"
00223 asserts that a family is a group_of people, and not that groups of
00224 people are families.
00225
00226 The distinction here seems to be whether or not the modifier was
00227 definite or indefinite: "THE color of ...." vs. "A form of.." or
00228 "A group of..."
00229
00230 XXX This is a real bug/hang-up in the triples processing code:
00231 being unaware of this distinction seems to cause some triples
00232 to come out "backwards" (i.e. that clothing is always a skirt).
00233 Caution to be used during seme formation! XXX
00234
00235
00236 Learning Semantic Categories
00237 ----------------------------
00238 Consider the category of "types of motion". Currently, the RelEx frame
00239 rules include an explicit list of category members:
00240
00241 $Self_motion
00242 amble
00243 bustle
00244 canter
00245 clamber
00246 climb
00247 clomp
00248 coast
00249 crawl
00250 creep
00251
00252 This list clearly encodes a-priori knowledge about locomotion. It would
00253 be better if the members of this category could be deduced by reading.
00254 There are three ways in which this might be done. One might someday
00255 read a sentence that asserts "Crawling is a type of locomotion". This
00256 seems unlikely, as this is common-sense knowledge, and common-sense
00257 knowledge is not normally encoded in text. A second possibility is to
00258 learn the meaning of the word "crawl" the way that children learn it:
00259 to have someone point at a centipede and say "gee, look at that thing
00260 crawl!" Such experiential, cross-sensory learning would indeed be an
00261 excellent way to gain new knowledge. However, there are two snags:
00262 1) It presumes the existence of a teacher who already knows how to use
00263 the word "crawl", and 2) It is outside of the scope of what one person
00264 (i.e. me) can acheive in a limited amount of time. A third possibility
00265 is statistical learning: to observe a large number of statements
00266 containing the word "crawl", and, based on these, deduce that it is a
00267 type of locomotion.
00268
00269 In the following, the third approach is presumed. This is because the
00270 author has in hand both the statistical and the linguistic tools that
00271 would allow such observation and deduction to be made.
00272
00273
00274 Consistency Checking
00275 --------------------
00276 Consider the following three sentences:
00277 Aristotle is a man.
00278 Men are mortal.
00279 Aristotle is mortal.
00280
00281 Or:
00282
00283 Berlin is the capital of Germany.
00284 Capitals are cities.
00285 Berlin is a city.
00286
00287 Assume the first two sentences were previously determined to be true,
00288 with a high confidence value. How can we determine that the third
00289 sentence is plausible, i.e. consistent with the first two sentences?
00290
00291 Upon reading the third sentence, it could be turned into a hypothetical
00292 statement, and suggested as the target of the PLN backward chainer. If
00293 the chainer is able to deduce that it is true, then the confidence of
00294 all three statements can increase: they form a set of mutually
00295 self-supporting statements.
00296
00297 So, for example, the above generate:
00298 capital_of(Germany,Berlin)
00299 isa(city, capital)
00300 isa(city, Berlin)
00301
00302 The prepositional construction XXX_of(A,B) allows the deduction that
00303 isa(XXX,A) (a deduction which can be made directly from the raw sentence
00304 input, and does not need to be processed from the prepositional form.
00305 (Right??) Certainly this is true for kind_of and capital_of, is this
00306 true for all prepositional uses of "of"?
00307
00308 Normally, a country can have only one capital; thus we need an exclusion
00309 rule:
00310
00311 if capital_of(X,Y) and different(Y,Z) then not capital_of(X,Z)
00312
00313 There are potentially lots of such unique relations, so the above should
00314 be formulated as
00315
00316 if R(X,Y) and uniq_grnd_relation(R) and different(Y,Z) then not R(X,Z)
00317
00318 Thus, we have a class of uniquely-grounded relations, of which capital_of
00319 is one. Part of the learning process is to somehow discover rules of the
00320 above form.
00321
00322
00323 Using triples for input
00324 -----------------------
00325 Other problems: Consider the sentences:
00326 "A hospital is a place where you go when you are sick."
00327
00328 One may deduce that "A hospital is a place", but one must be careful
00329 in making use of such knowledge....
00330
00331
00332 Pseudo-clustering
00333 -----------------
00334 A key step in concept formation is determining if/when two distinct
00335 instances are really the same concept. This is to be accomplished by
00336 comparing two concepts, and returning a (simple) truth value indicating
00337 the likelyhood that they are the same. Many algos are possible. The
00338 simplest might be the following:
00339
00340 Take a weighted average of link-comparisons, comparing:
00341 -- WordNode. A mismatch here means that it is highly unlikely that the
00342 concepts are identical, unless the WordNode is a pronoun.
00343 -- ContextNode. A mismatch here means it's highly unlike that the
00344 concepts are identical, unless the ContextNode is one of the base
00345 "common-sense" contexts.
00346 -- Compare modifiers. A modifier present in one, but absent in the
00347 other, is "neutral". Conflicting modifiers suggest a conceptual
00348 mis-match: If the current sentence calls a ball "green", while
00349 a previous one called it "red", then the two references are probably
00350 to two different balls. Ditto for big, small, light, heavy, etc.
00351 -- Compare relations, e.g. capital_of, next_to, etc. Much like comparing
00352 modifiers.
00353
00354
00355
00356 Part III -- Implementation
00357 ==========================
00358
00359 Concrete Data Representation
00360 ----------------------------
00361 Let's now look at how to represent some of these ideas concretely, in
00362 terms of OpenCog hypergraphs.
00363
00364 First, a SemeNode will be used as the main anchor point. A SemeNode is
00365 used, instead of a ConceptNode, so as to leave ConceptNode open for
00366 other uses; the goal here is to minimize confusion/cross-talk between
00367 this and other parts of OpenCog.
00368
00369 Initially, when first creating a SemeNode, it should probably be given
00370 a name that is a copy of the WordInstanceNode that inspired it: "John
00371 threw a red ball" leads to
00372
00373 SemeNode "ball@634a32ebc"
00374
00375 A basic name is needed for the concept, and so, in complete analogy
00376 betwen WordInstanceNodes and WordNodes, we create:
00377
00378 LemmaLink
00379 SemeNode "ball@634a32ebc"
00380 WordNode "ball"
00381
00382 The LemmaLink is used to indicate the root form of the word, stripped
00383 of inflection, number, tense, etc. The idea that it's red is indicated
00384 by using modifiers, the *same* modifiers as RelEx uses, with essentially
00385 the same meanings:
00386
00387 EvaluationLink
00388 DefinedLinguisticRelationNode "_amod"
00389 ListLink
00390 SemeNode "ball@634a32ebc"
00391 SemeNode "red@a47343df"
00392
00393 It is presumed that, at some point, the aobve will be converted to:
00394
00395 EvaluationLink
00396 SemanticRelationNode "color_of@6543"
00397 ListLink
00398 SemeNode "ball@634a32ebc"
00399 SemeNode "red@a47343df"
00400
00401 This would need to work by recognition that "amod" together with "red"
00402 implies that "red" is a color. Could probably be done with a rule.
00403
00404 IF amod($X,$Y) ^ is-a($X, object) ^ is-a($Y, color)
00405 THEN color_of($X,$Y) ^ &delete_link(amod($X,$Y))
00406
00407 How do we bootstrap to there? Via upper-ontology-like statments:
00408 "Red is a color" and "A ball is an object". At some later, more
00409 abstract stage, one must ask: "Is a ball the kind of object that
00410 can have a color?"; but at first, we shall start naively, and
00411 assume that it is.
00412
00413
00414 Seme Promotion
00415 --------------
00416 The current code is organized around the idea of "seme promotion":
00417 snippets of scheme code that, given a word instance, return a seme.
00418 Currently, three different promoters are implemented.
00419
00420 -- trivial-promoter --
00421 Creates a new, unique SemeNode for *every* input WordInstanceNode.
00422 That is, no two words are ever assumed to refer to the same seme.
00423
00424 -- same-lemma-promoter --
00425 Creates a new SemeNode only if there isn't one already having the
00426 same lemma as the word instance. That is, it assumes that any given
00427 word always refers to the same seme. In a certain way, this is the
00428 "opposite* behaviour from the trivial promoter.
00429
00430 -- same-dependency-promoter --
00431 Re-uses an existing seme only if it has a superset of the dependency
00432 relations of the word-instance. Otherwise, it creates a new seme.
00433
00434 The motivation for, and operation of this last is discussed below. It
00435 has a number of subtle points, including problems with representing
00436 hypothetical (truth-query) questions.
00437
00438 Although the current seme-promotion code is implemented in scheme, a
00439 long-term goal is to re-implement this code in terms of patterns, or
00440 ImplicationLinks, so that all seme promotion could be done by using the
00441 pattern matcher (i.e. a forward/backward chainer). That is, we want to
00442 minimize/eliminate the use of scheme code (or C++ code or python... or
00443 any code at all), and represent all graph transformations as hypergraphs
00444 themselves.
00445
00446
00447 Reference Resolution, and "Decorations"
00448 ---------------------------------------
00449 Consider the statement and truth-query below:
00450
00451 "Ben violently threw the green ball."
00452 "Did Ben softly throw the green ball?"
00453
00454 The problem of reference resolution is that of determining wether the
00455 "Ben" in the question is the same "Ben" as in the statement. Likewise
00456 for the verb "throw", since maybe there was some *other* ball that Ben
00457 did throw softly.
00458
00459 The current code uses the notion of "decorations", and uses pattern
00460 matching against the decorations to determine whether different word
00461 instances might refer to the same seme.
00462
00463 Given some seme, its "decorations" are all of the relations that have
00464 that seme appearing in the head-position of the relation. So, for
00465 example, for the verb V == "throw", the word instance V is promoted
00466 to the "seme" V by decorating V with relations. In this example,
00467 V is "decorated" with _subj(V, Ben), _obj(V, ball) _advmod(V, violently).
00468
00469 Then later, when I see "Did Mike throw a ball?" the answer is no,
00470 because this V is decorated in a different way. That is, this instance
00471 of "throw" can't possibly be the same "throw" as in the earlier
00472 sentence, because of the different decorations.
00473
00474 "Did Ben throw a ball?" -- yes, because this V is decorated with a
00475 subset of _subj(V, Ben), _obj(V, ball) _advmod(V, violently).
00476
00477 "Did Ben softly throw the ball"? No -- This instance of "throw" is
00478 decorated differently -- it can't possibly be the same "seme" as the
00479 violent throw.
00480
00481 Perhaps there is some other "throw" in the system ... maybe there is
00482 another sentence --- "Ben threw a red ball softly" already in the
00483 system, in which case the "throw" seme does appear to be the same. And
00484 also, the "ball" word-instance does match "red ball" so the answer is
00485 "yes". And, as a bonus, we know which ball was referred to -- it had to
00486 be the red ball -- its the only match.
00487
00488 In the above, all "decorations" were in the form of RelEx relations.
00489 At some point, these can be, and should be, replaced by bona-fide
00490 "frames". The need for this is already fairly clear, when one compares
00491 questions such as "Has Ben always been throwing balls?" to the
00492 syntactically similar "Has that tree always been standing there?".
00493 Here, "throwing" is a dyanmic, transient activity, while "standing"
00494 is not. Decorating with RelEx relations is not enough to capture this
00495 difference. Any sort of more sophisticated logical deduction or
00496 inference will need access to such frame decorations.
00497
00498 An important point here is that "bona-fide" frame relations can have
00499 the same structure as the RelEx relations: that is, both can still be
00500 considered to be "decorations", and so pattern matching and other
00501 algorithms can benefit from this structural similarity.
00502
00503
00504 Promotion and decoration of questions
00505 -------------------------------------
00506 The act of seme promotion, as described above, performs question
00507 answering more-or-less as a side-effect. That is, the result of seme
00508 promotion on the sentence "Ben threw a ball" and the question "Did Ben
00509 throw a ball?" is a match on "Ben" and "ball", leaving only the verb
00510 to be compared. The primary algorithmic concern is to avoid accidentally
00511 promoting the question into a statement: i.e. to map both verbs to the
00512 same seme.
00513
00514 RelEx decorates the verb "throw" in the question with
00515
00516 TRUTH-QUERY-FLAG(throw, T)
00517 HYP(throw, T)
00518
00519 which is rendered as
00520
00521 ; TRUTH-QUERY-FLAG (throw, T)
00522 (InheritanceLink (stv 1.0 1.0)
00523 (WordInstanceNode "throw@dcae1b05-54cf-4f8a-b650-cd7327fe6fb6")
00524 (DefinedLinguisticConceptNode "truth-query")
00525 )
00526 ; HYP (throw, T)
00527 (InheritanceLink (stv 1.0 1.0)
00528 (WordInstanceNode "throw@dcae1b05-54cf-4f8a-b650-cd7327fe6fb6")
00529 (DefinedLinguisticConceptNode "hyp")
00530 )
00531
00532 Thus, the following verb promotion rules are needed:
00533
00534 1) If a seme is decorated with HYP or TRUTH-QUERY, then a word instance
00535 without these decorations must not be promoted to that seme. This
00536 avoids the problem that would occur if the question "Did Ben throw a
00537 rock?" was followed by the question "What did Ben throw?". The second
00538 "throw" does not have either HYP or TRUTH-QUERY, and so if promotion
00539 wass allowed, it would find "rock" as an answer.
00540
00541 2) If a seme is NOT decorated with HYP or TRUTH-QUERY, then a word
00542 instance that does have these decorations must not be promoted to
00543 that seme. This minimizes any chances that parts of the question
00544 might get promoted into statements, or that a statement might be
00545 reverted into a question.
00546
00547 3) If a word-instance is decorated with HYP or TRUTH-QUERY, then, when
00548 a seme is first being created, these decorations must be attached to
00549 the seme.
00550
00551
00552
00553
00554 Contextual representation
00555 -------------------------
00556 A given SemeNode can be relevent to one or more contexts. The
00557 relationship is indicated with a link to a named context. Say, for
00558 example that, during IRC chat, that JaredW stated that "The ball
00559 is red". We'd then have a
00560
00561 ContextLink
00562 ContextNode "# IRC:JaredW"
00563 SemeNode "ball@634a32ebc"
00564
00565 At this time, the creation/naming of ContextNodes would be ad-hoc, on
00566 a case-by-case basis. All input from the MIT ConceptNet project would
00567 be marked with with something like
00568
00569 ContextNode "# MIT ConceptNet dump 20080605"
00570
00571 A SemeNode, once determined to be sufficiently general, might belong
00572 to several concepts. There might be a heirarchy of ConceptNode
00573 inclusions: so, for example, concepts in "common-sense" contexts, such
00574 as ConceptNet, would be judged to be sufficiently universal to also hold
00575 in IRC contexts, Project Gutenberg contexts, Wikipedia contexts, etc.
00576
00577 The only reason for using ContextLinks instead of
00578
00579 EvaluationLink
00580 DefinedRelationshipNode "Context"
00581 ListLink
00582 ContextNode "# IRC:JaredW"
00583 SemeNode "ball@634a32ebc"
00584
00585 Is to save a bit of RAM storage; there will be at least one ContextLink
00586 for every SemeNode.
00587
00588
00589 Boostraping
00590 -----------
00591 -- Read sentence from some source.
00592 -- Process sentence for triples
00593 -- Create initial SemeNodes that match key word instances.
00594 -- Add (temporary?) ContextLinks to indicate source.
00595 -- Scan existing SemeNodes for possible match.
00596 -- Merge SemeNodes if a plausible match is found. ("clustering")
00597
00598
00599 Current Implementation Status
00600 -----------------------------
00601 The current implementation does just about none of the whiz-bang stuff
00602 discussed above. So far, just the most basic scaffolding has been set
00603 up.
00604
00605 All "seme promoters" function by accepting a word-instance, and
00606 returning a corresponding seme. The various promoters are of
00607 different levels of sophistication, and use different algoorithms.
00608 They all create an InheritenceLink relating the original word instance
00609 to the seme, like so:
00610
00611 InheritenceLink
00612 WordInstanceNode hello@123
00613 SemeNode greeting@789
00614
00615 Most/all of these also create a link to the lemmatized word form
00616 i.e. "the English-language word" that corresponds to the seme:
00617
00618 LemmaLink
00619 SemeNode greeting@789
00620 WordNode hello
00621
00622 All of these promoters could be, and eventually should be
00623 re-implemented as ImplicationLinks, so that all promotion runs
00624 entirely withing OpenCog. For now, they are implemented in scheme.
00625
00626
00627
00628 Deduction
00629 ---------
00630 Suppose we have a set of (consistent) prepositional relationships. What
00631 can we do with them? For example, can we deduce that a certain verb is
00632 a type of locomotion, based on its use with regard to prepositions?
00633
00634 Hmm. Time to write some rules, and experiment and see what happens.
00635 Not clear how unambiguous the copulas and preps will be.
00636
00637 ToDo:
00638 -----
00639 Create the following new atoms types:
00640 ContextNode
00641 SemanticRelationshipNode
00642
00643 todo -- add POS tagging!!
00644
00645
00646 ToDo: Reification of triples ...
00647 Re-examin Markov Logic Networks ...
00648 Phrasal verbs vs. prepositional phrases
00649
00650
00651 Notes
00652 -----
00653 Initially 51 secs to load 3K sentences
00654 However, if sqldb is open, this explodes to 5 minutes! (I guess
00655 because an sql query is made for each new atom added to the atomspace).
00656 (err, well my system is 100% cpu even before running this...)
00657
00658 8 27 16 secs per sentence ...
00659
00660 bugs: $prep maps to Ss, also to _subj ...
00661
00662 Took 995 minutes to process 3870 sentences, used 466 MBytes
00663 Took 288 minutes to process 3000 sentences, used 284 MBytes
00664 From 3000 sentences, got 6978 triples
00665 But only 1475 unique triples on 1475 semes
00666
00667
00668 References:
00669 -----------
00670 [FEAT] Feature extraction. See
00671 http:
00672 http:
00673
00674
00675 Alexander Yates and Oren Etzioni.
00676 [http:
00677 Unsupervised Methods for Determining Object and Relation Synonyms on
00678 the Web]. Journal of Artificial Intelligence Research 34, March,
00679 2009, pages 255-296.
00680
00681 Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum
00682 SOFIE: A Self-Organizing Framework for Information Extraction
00683 WWW 2009 Madrid!
00684