README

Go to the documentation of this file.
00001 
00002                       Semes - Concept Formation
00003                       -------------------------
00004                        Linas Vepstas May 2009
00005                        Revised September 2009
00006 
00007 
00008 This directory contains some notes and experimental code for forming
00009 and extracting conceptual entities or "semes" from English text.  An
00010 example of a "conceptual entity" would be the "Great Southern Railroad":
00011 a business, a railway, that existed at a certain point in space and time.
00012 Seme formation overlaps and combines two related tasks: "named entity
00013 extraction" and "reference resolution". The first, "named entity
00014 extraction" attempts to identify named objects. "Reference resolution"
00015 attempts to determine when two different words (usually in different
00016 sentences) refer to the same thing.
00017 
00018 The primary challenges of seme extraction are:
00019 1) Constructing a data representation that is amenable to reasoning,
00020    and to question answering. 
00021 2) Recognising when two different semes refer to the same concept,
00022    (so that they can be merged)
00023 3) Recognizing when one seme (incorrectly) refers to distinct concepts,
00024    and should be split apart.
00025 4) Learning new conceptual classifications, ontologies and relations;
00026    so, for example, when encountering a new, unknown, word, to determine
00027    that it is, for example, a previously unknown color name.
00028 
00029 The text below is split into three parts: A motivational overview of
00030 "semes", a discussion of various issues that arise in concept formation,
00031 and, finally, details of the data structures and algorithms currently
00032 implemented in this directory.
00033 
00034 
00035 Part I -- Motivation -- Why semes?
00036 =======================-----------
00037 The notion of a seme is introduced here to solve a "simple" technical
00038 problem in knowledge representation.  The goal of using "semes" is to
00039 move away from using words and/or word-instances to represent "things".
00040 Thus, "Mary's shoes" and "Tom's shoes" would be two different semes,
00041 because they are two different "things" (although both things are
00042 related to the word "shoe").  
00043 
00044 By contrast, "Mary had some shoes. The shoes were red."  has two
00045 distinct word-instances: the word-instance "shoe" in the first sentence,
00046 and the word instance "shoe" in the second sentence.  However, these
00047 two words stand for the same concept: "Mary's red shoes". The goal of
00048 using semes is to collapse these two word-instances into one seme,
00049 without getting tangled with the fact that these two different 
00050 word-instances of "shoe" are involved in different syntactic relations
00051 in different sentences (... sentences that were possibly uttered by 
00052 different individuals at different times, even!)
00053 
00054 Thus, "semes", as defined here, are meant to be an abstraction that 
00055 behaves much like "concepts", yet, in a certain way, behinving much
00056 like "words".  This leaves the notion of "concepts" free for other,
00057 more abstract usage. Semes are meant to be fairly closely related to
00058 "words": they are only one small step towards the general goal of 
00059 "conceptualization". In particular, semes are meant to be sufficiently
00060 word-like that they can be used in most relations that words are used
00061 in.  so, for example, if there's a RelEx relation that connects two
00062 words, then one could have exactly the same structure connecting two
00063 semes. 
00064 
00065 Thus, "semes" are removed by only a small step from linguistic usage;
00066 they provide a needed abstraction on the road to true "concepts" and
00067 are just flexible enough to support basic tasks, such as (basic) entity 
00068 identification,  reference resolution, and (basic) reasoning.
00069 
00070 
00071 What is a Concept?
00072 ------------------
00073 So far, almost all the processing described in nlp/triples/README 
00074 has been in terms of graph modifications performed on individual
00075 sentences, containing WordInstanceNodes, and links to WordNodes.
00076 In order to promote text input into concepts, and to reason with 
00077 concepts, we need to define what a concept is, and where its boundaries
00078 extend to.  At this point, the goal is not to define a super-abstract
00079 notion of a concept that passes all epistomological tests, but rather 
00080 a practical, if flawed, data structure that is adequate for representing
00081 data learned by reading, learned through linguistic corpus analysis. 
00082 The emphasis here is "flawed but practical": it should be just enough
00083 to take us to the next level of abstraction.
00084 
00085 Naively, a concept would seem to have the following parts:
00086 -- a SemeNode, to serve as an anchor (could be ConceptNode)
00087 -- a linguistic expression complex. 
00088 -- a WordNet sense tag (optional)
00089 -- a DBPedia URI tag (optional)
00090 -- an OpenCyc tag ... etc. you get the idea.
00091 -- basic ontological links -- is-a, has-a, part-of, etc.
00092 -- prepositional relations (next-to, inside-of, etc)
00093 -- Context tag(s). See section "Context" below.
00094 
00095 What is a "linguistic expression complex"?  It deals with the idea that
00096 most concepts are not expressible as single words: for example, "Mary
00097 had a red baloon".  The head concept here is "baloon": it is an instance
00098 of the class of all baloons, and specifically, this instance is red. 
00099 Thus, a "linguistic expression complex" would consist of:
00100 
00101 -- a head WordNode, to give single, leading name.
00102    Possibly several WordNodes to give it multiple names?
00103 -- dependent modifier tags (e.g. "red")
00104 -- a part-of-speech tag, to provide a rough linguistic categorization
00105 -- a collection of disjunct tags, representing possible linguistic
00106    use of the WordNode to represent this concept.
00107 
00108 
00109 Promoting Words to Concepts
00110 ---------------------------
00111 Consider the task of promoting word-relations to concepts.  Consider the 
00112 following relationships:
00113 
00114    is_a(bark, sound)
00115    part_of(bark, tree)
00116 
00117 We know that these two relations refer to different senses of the word
00118 "bark". Yet, if these two are deduced by reading, how should the system
00119 recognize that two different concepts are at play?  How should the
00120 self-consistency of a set of relations be assessed? Assuming that the
00121 input text is not intentionally lying, then, under what circumstances
00122 do a set of conflicting assertions require that the underlying word be
00123 recognized as embodying two different concepts?
00124 
00125 One possible approach is to assign tentative WordNet-based word-senses 
00126 using either the Mihalcea algorithm, or table-lookup from syntax-tagged
00127 senses (see the wsd-post/README for details). One nice aspect of WordNet
00128 tagging is that the built-in WordNet ontology can be used to double-check,
00129 strengthen certain sense assignments: this, for example:
00130 
00131     bark%1:20:00:: has part-holonym tree_trunk%1:20:00::
00132 
00133 while
00134 
00135     bark%1:11:02:: has direct hypernym noise%1:11:00::
00136     and inherited hypernym sound%1:11:00::
00137 
00138 Thus, triples that have been read in, and tagged with WordNet senses,
00139 can be verified against the WordNet ontology for the correctness of
00140 sense assignments.  While this is a reasonable starting point, and
00141 gives an easy leg-up, it does not solve the more general problem of
00142 distinguishing and refining concepts.
00143 
00144 Another approach is to use part-of-speech tags, and disjunct tags, as
00145 stand-ins for word senses.  That is, the parser has already identified
00146 different word-instances according to their part-of-speech, and so at
00147 least a rough word-sense classification is available from that.  That
00148 is, it is safe to assume that a noun and a verb never represent the 
00149 same concept (at a certain level...).  It has also been seen (see 
00150 wsd-post/README) that the disjunct used during parsing has a high 
00151 correlation with the word-sense; the disjunct used during parsing 
00152 can be considered to be a very fine-grained part-of-speech tag. Thus,
00153 instead of using Wordnet sense tags as concept "nucleation centers",
00154 the disjuncts could be used as such.
00155 
00156 Two distinct processes are at play: 1) recognizing that two different
00157 word instances refer to the same concept, and 2) recognizing that a
00158 previously learned concept should be refined into two distinct concepts.
00159 (For example, having learned the properties of a "pencil", one must 
00160 recognize at some point that a "mechanical pencil" and a "wooden pencil"
00161 have many incompatbile properties, and thus the notion of a pencil must
00162 be split into these two new concepts).
00163 
00164 The most direct route to either of these processes is by means of
00165 "consistency checking": using forward and backward chaining to determine
00166 whether two distinct statements are compatible with each other. When
00167 they are, then the two different word-instances can be assumed to refer
00168 to the same concept; relationships can then be merged.
00169 
00170 Part II -- Issues 
00171 =================
00172 A short discussion of issues that arise in concept/seme formation.
00173 
00174 Context
00175 -------
00176 Almost all facts are contextual. You can't just say "John has a red
00177 ball" and promote that to a fact. You must presume a context of some
00178 sort: "Someone said during an IRC chat that John has a red ball", or,
00179 "While reading Emily Bronte, I learned that John had a red ball."  The
00180 context is needed for two reasons:
00181 
00182 1) When obtaining additional info within the same context, it is 
00183 simpler/safer to deduce references, e.g. that the John in the second
00184 sentence is the same John as in the first sentence.
00185 
00186 2) When obtaining additional info within a different context, it is
00187 simpler/safer to assume that references are distinct: that, for example,
00188 "John" an an Emily Dickinson novel is not the same "John" in an Emily
00189 Bronte novel.
00190 
00191 Thus, it makes sense to tag recently formed SemeNodes with a context tag.
00192 
00193 
00194 A priori vs. Deduced Knowledge
00195 ------------------------------
00196 Consider the following:
00197 
00198    capital_of(Germany, Berlin)
00199 
00200 This triple references a lot of a-priori knowledge.  We know that
00201 capitals are cities; thus there is a strong temptation to write a
00202 processing rule such as "IF ($var0,capital) THEN ($var0,city)".
00203 Similarly, one has a-priori knowledge that things which have capitals
00204 are political states, and so one is tempted to write a rule asserting
00205 this: "IF (capital_of($var0, $var1)) THEN political_state($var1)".
00206 
00207 A current working assumption of what follows is that the various rules
00208 will/should encode a minimum of a-priori "real-world" knowledge.
00209 Instead, the goal here is to create a system that can learn, deduce
00210 such "real-world" knowledge.
00211 
00212 
00213 Definite vs. Indefinite
00214 -----------------------
00215 There is a subtle semantic difference between triples that describe
00216 definite properties, vs. triples that describe generic properites, 
00217 or semantic classes.  Thus, for example, "color_of(sky,blue)" seems 
00218 unambiguous: this is because we know that the sky can only ever have
00219 one color (well, unless you are looking at a sunset). Consider 
00220 "form_of(clothing, skirt)": this asserts that a skirt is a form_of 
00221 clothing, and not that clothing is always a skirt. The form_of 
00222 indicates a semantic category.  Similarly, "group_of(people, family)"
00223 asserts that a family is a group_of people, and not that groups of
00224 people are families.
00225 
00226 The distinction here seems to be whether or not the modifier was
00227 definite or indefinite: "THE color of ...." vs. "A form of.." or
00228 "A group of..."
00229 
00230 XXX This is a real bug/hang-up in the triples processing code:
00231 being unaware of this distinction seems to cause some triples
00232 to come out "backwards" (i.e. that clothing is always a skirt).
00233 Caution to be used during seme formation! XXX
00234 
00235 
00236 Learning Semantic Categories
00237 ----------------------------
00238 Consider the category of "types of motion". Currently, the RelEx frame
00239 rules include an explicit list of category members:  
00240 
00241    $Self_motion
00242    amble
00243    bustle
00244    canter
00245    clamber
00246    climb
00247    clomp
00248    coast
00249    crawl
00250    creep
00251 
00252 This list clearly encodes a-priori knowledge about locomotion.  It would
00253 be better if the members of this category could be deduced by reading.
00254 There are three ways in which this might be done. One might someday
00255 read a sentence that asserts "Crawling is a type of locomotion".  This
00256 seems unlikely, as this is common-sense knowledge, and common-sense
00257 knowledge is not normally encoded in text. A second possibility is to
00258 learn the meaning of the word "crawl" the way that children learn it: 
00259 to have someone point at a centipede and say "gee, look at that thing
00260 crawl!"  Such experiential, cross-sensory learning would indeed be an
00261 excellent way to gain new knowledge. However, there are two snags: 
00262 1) It presumes the existence of a teacher who already knows how to use
00263 the word "crawl", and 2) It is outside of the scope of what one person
00264 (i.e. me) can acheive in a limited amount of time.  A third possibility
00265 is statistical learning: to observe a large number of statements
00266 containing the word "crawl", and, based on these, deduce that it is a
00267 type of locomotion.
00268 
00269 In the following, the third approach is presumed. This is because the
00270 author has in hand both the statistical and the linguistic tools that
00271 would allow such observation and deduction to be made.
00272 
00273 
00274 Consistency Checking
00275 --------------------
00276 Consider the following three sentences:
00277    Aristotle is a man.
00278    Men are mortal.
00279    Aristotle is mortal.
00280 
00281 Or:
00282 
00283    Berlin is the capital of Germany.
00284    Capitals are cities.
00285    Berlin is a city.
00286 
00287 Assume the first two sentences were previously determined to be true,
00288 with a high confidence value. How can we determine that the third 
00289 sentence is plausible, i.e. consistent with the first two sentences?
00290 
00291 Upon reading the third sentence, it could be turned into a hypothetical
00292 statement, and suggested as the target of the PLN backward chainer. If
00293 the chainer is able to deduce that it is true, then the confidence of
00294 all three statements can increase: they form a set of mutually
00295 self-supporting statements.
00296 
00297 So, for example, the above generate:
00298    capital_of(Germany,Berlin)
00299    isa(city, capital)
00300    isa(city, Berlin)
00301 
00302 The prepositional construction XXX_of(A,B) allows the deduction that
00303 isa(XXX,A) (a deduction which can be made directly from the raw sentence
00304 input, and does not need to be processed from the prepositional form.
00305 (Right??) Certainly this is true for kind_of and capital_of, is this
00306 true for all prepositional uses of "of"?
00307 
00308 Normally, a country can have only one capital; thus we need an exclusion 
00309 rule:
00310 
00311    if capital_of(X,Y) and different(Y,Z) then not capital_of(X,Z)
00312 
00313 There are potentially lots of such unique relations, so the above should
00314 be formulated as
00315 
00316    if R(X,Y) and uniq_grnd_relation(R) and different(Y,Z) then not R(X,Z)
00317 
00318 Thus, we have a class of uniquely-grounded relations, of which capital_of
00319 is one.  Part of the learning process is to somehow discover rules of the
00320 above form.
00321 
00322 
00323 Using triples for input
00324 -----------------------
00325 Other problems: Consider the sentences:
00326 "A hospital is a place where you go when you are sick."
00327 
00328 One may deduce that "A hospital is a place", but one must be careful
00329 in making use of such knowledge....
00330 
00331 
00332 Pseudo-clustering
00333 -----------------
00334 A key step in concept formation is determining if/when two distinct
00335 instances are really the same concept. This is to be accomplished by 
00336 comparing two concepts, and returning a (simple) truth value indicating
00337 the likelyhood that they are the same.  Many algos are possible.  The
00338 simplest might be the following:
00339 
00340 Take a weighted average of link-comparisons, comparing:
00341 -- WordNode. A mismatch here means that it is highly unlikely that the
00342    concepts are identical, unless the WordNode is a pronoun.
00343 -- ContextNode. A mismatch here means it's highly unlike that the 
00344    concepts are identical, unless the ContextNode is one of the base
00345    "common-sense" contexts.
00346 -- Compare modifiers. A modifier present in one, but absent in the 
00347    other, is "neutral". Conflicting modifiers suggest a conceptual
00348    mis-match: If the current sentence calls a ball "green", while 
00349    a previous one called it "red", then the two references are probably
00350    to two different balls. Ditto for big, small, light, heavy, etc.
00351 -- Compare relations, e.g. capital_of, next_to, etc. Much like comparing
00352    modifiers.
00353 
00354 
00355 
00356 Part III -- Implementation
00357 ==========================
00358 
00359 Concrete Data Representation
00360 ----------------------------
00361 Let's now look at how to represent some of these ideas concretely, in
00362 terms of OpenCog hypergraphs.
00363 
00364 First, a SemeNode will be used as the main anchor point. A SemeNode is
00365 used, instead of a ConceptNode, so as to leave ConceptNode open for 
00366 other uses; the goal here is to minimize confusion/cross-talk between
00367 this and other parts of OpenCog.
00368 
00369 Initially, when first creating a SemeNode, it should probably be given
00370 a name that is a copy of the WordInstanceNode that inspired it: "John 
00371 threw a red ball" leads to 
00372 
00373     SemeNode "ball@634a32ebc"
00374 
00375 A basic name is needed for the concept, and so, in complete analogy
00376 betwen WordInstanceNodes and WordNodes, we create:
00377 
00378    LemmaLink
00379        SemeNode "ball@634a32ebc"
00380        WordNode "ball"
00381 
00382 The LemmaLink is used to indicate the root form of the word, stripped
00383 of inflection, number, tense, etc.  The idea that it's red is indicated
00384 by using modifiers, the *same* modifiers as RelEx uses, with essentially
00385 the same meanings:
00386 
00387     EvaluationLink
00388        DefinedLinguisticRelationNode "_amod"
00389        ListLink
00390           SemeNode "ball@634a32ebc"
00391           SemeNode "red@a47343df"
00392 
00393 It is presumed that, at some point, the aobve will be converted to:
00394 
00395     EvaluationLink
00396        SemanticRelationNode "color_of@6543"
00397        ListLink
00398           SemeNode "ball@634a32ebc"
00399           SemeNode "red@a47343df"
00400 
00401 This would need to work by recognition that "amod" together with "red"
00402 implies that "red" is a color. Could probably be done with a rule. 
00403 
00404    IF amod($X,$Y) ^ is-a($X, object) ^ is-a($Y, color)
00405    THEN color_of($X,$Y) ^ &delete_link(amod($X,$Y))
00406 
00407 How do we bootstrap to there? Via upper-ontology-like statments:
00408 "Red is a color" and "A ball is an object".   At some later, more
00409 abstract stage, one must ask: "Is a ball the kind of object that
00410 can have a color?"; but at first, we shall start naively, and 
00411 assume that it is.
00412 
00413 
00414 Seme Promotion
00415 --------------
00416 The current code is organized around the idea of "seme promotion":
00417 snippets of scheme code that, given a word instance, return a seme.
00418 Currently, three different promoters are implemented.
00419 
00420 -- trivial-promoter -- 
00421    Creates a new, unique SemeNode for *every* input WordInstanceNode. 
00422    That is, no two words are ever assumed to refer to the same seme.
00423 
00424 -- same-lemma-promoter -- 
00425    Creates a new SemeNode only if there isn't one already having the 
00426    same lemma as the word instance.  That is, it assumes that any given 
00427    word always refers to the same seme. In a certain way, this is the
00428    "opposite* behaviour from the trivial promoter.
00429 
00430 -- same-dependency-promoter --
00431    Re-uses an existing seme only if it has a superset of the dependency
00432    relations of the word-instance. Otherwise, it creates a new seme.
00433 
00434 The motivation for, and operation of this last is discussed below. It
00435 has a number of subtle points, including problems with representing
00436 hypothetical (truth-query) questions.
00437 
00438 Although the current seme-promotion code is implemented in scheme, a 
00439 long-term goal is to re-implement this code in terms of patterns, or
00440 ImplicationLinks, so that all seme promotion could be done by using the
00441 pattern matcher (i.e. a forward/backward chainer).  That is, we want to
00442 minimize/eliminate the use of scheme code (or C++ code or python... or
00443 any code at all), and represent all graph transformations as hypergraphs
00444 themselves.
00445 
00446 
00447 Reference Resolution, and "Decorations"
00448 ---------------------------------------
00449 Consider the statement and truth-query below:
00450 
00451    "Ben violently threw the green ball."
00452    "Did Ben softly throw the green ball?"
00453 
00454 The problem of reference resolution is that of determining wether the
00455 "Ben" in the question is the same "Ben" as in the statement. Likewise
00456 for the verb "throw", since maybe there was some *other* ball that Ben
00457 did throw softly.
00458 
00459 The current code uses the notion of "decorations", and uses pattern
00460 matching against the decorations to determine whether different word
00461 instances might refer to the same seme.  
00462 
00463 Given some seme, its "decorations" are all of the relations that have
00464 that seme appearing in the head-position of the relation.  So, for 
00465 example, for the verb V == "throw", the word instance V is promoted 
00466 to the "seme" V by decorating V with relations.  In this example,  
00467 V is "decorated" with _subj(V, Ben), _obj(V, ball) _advmod(V, violently).
00468 
00469 Then later, when I see "Did Mike throw a ball?" the answer is no, 
00470 because this V is decorated in a different way.  That is, this instance
00471 of "throw" can't possibly be the same "throw" as in the earlier 
00472 sentence, because of the different decorations.
00473 
00474 "Did Ben throw a ball?" -- yes, because this V is decorated with a 
00475 subset of  _subj(V, Ben), _obj(V, ball) _advmod(V, violently).
00476 
00477 "Did Ben softly throw the ball"?  No -- This instance of "throw" is
00478 decorated differently -- it can't possibly be the same "seme" as the 
00479 violent throw.
00480 
00481 Perhaps there is some other  "throw" in the system ... maybe there is
00482 another sentence ---  "Ben threw  a red ball softly" already in the
00483 system, in which case the "throw" seme does appear to be the same.  And
00484 also, the "ball" word-instance does match "red ball" so the answer is
00485 "yes".  And, as a bonus, we know which ball was referred to -- it had to
00486 be the red ball -- its the only match.
00487 
00488 In the above, all "decorations" were in the form of RelEx relations. 
00489 At some point, these can be, and should be, replaced by bona-fide 
00490 "frames".  The need for this is already fairly clear, when one compares
00491 questions such as "Has Ben always been throwing balls?" to the 
00492 syntactically similar "Has that tree always been standing there?".
00493 Here, "throwing" is a dyanmic, transient activity, while "standing"
00494 is not.  Decorating with RelEx relations is not enough to capture this
00495 difference.  Any sort of more sophisticated logical deduction or 
00496 inference will need access to such frame decorations.
00497 
00498 An important point here is that "bona-fide" frame relations can have
00499 the same structure as the RelEx relations: that is, both can still be
00500 considered to be "decorations", and so pattern matching and other
00501 algorithms can benefit from this structural similarity.
00502 
00503 
00504 Promotion and decoration of questions
00505 -------------------------------------
00506 The act of seme promotion, as described above, performs question
00507 answering more-or-less as a side-effect.  That is, the result of seme
00508 promotion on the sentence "Ben threw a ball" and the question "Did Ben
00509 throw a ball?" is a match on "Ben" and "ball", leaving only the verb
00510 to be compared. The primary algorithmic concern is to avoid accidentally
00511 promoting the question into a statement: i.e. to map both verbs to the 
00512 same seme. 
00513 
00514 RelEx decorates the verb "throw" in the question with 
00515 
00516    TRUTH-QUERY-FLAG(throw, T)
00517    HYP(throw, T)
00518 
00519 which is rendered as
00520 
00521    ; TRUTH-QUERY-FLAG (throw, T)
00522    (InheritanceLink (stv 1.0 1.0)
00523       (WordInstanceNode "throw@dcae1b05-54cf-4f8a-b650-cd7327fe6fb6")
00524       (DefinedLinguisticConceptNode "truth-query")
00525    )
00526    ; HYP (throw, T)
00527    (InheritanceLink (stv 1.0 1.0)
00528       (WordInstanceNode "throw@dcae1b05-54cf-4f8a-b650-cd7327fe6fb6")
00529       (DefinedLinguisticConceptNode "hyp")
00530    )
00531 
00532 Thus, the following verb promotion rules are needed:
00533 
00534 1) If a seme is decorated with HYP or TRUTH-QUERY, then a word instance
00535    without these decorations must not be promoted to that seme. This
00536    avoids the problem that would occur if the question "Did Ben throw a 
00537    rock?" was followed by the question "What did Ben throw?". The second
00538    "throw" does not have either HYP or TRUTH-QUERY, and so if promotion 
00539    wass allowed, it would find "rock" as an answer.
00540 
00541 2) If a seme is NOT decorated with HYP or TRUTH-QUERY, then a word
00542    instance that does have these decorations must not be promoted to
00543    that seme.  This minimizes any chances that parts of the question
00544    might get promoted into statements, or that a statement might be 
00545    reverted into a question.
00546 
00547 3) If a word-instance is decorated with HYP or TRUTH-QUERY, then, when
00548    a seme is first being created, these decorations must be attached to
00549    the seme.
00550 
00551 
00552  
00553 
00554 Contextual representation
00555 -------------------------
00556 A given SemeNode can be relevent to one or more contexts. The 
00557 relationship is indicated with a link to a named context. Say, for 
00558 example that, during IRC chat, that JaredW stated that "The ball
00559 is red". We'd then have a 
00560 
00561     ContextLink
00562        ContextNode "# IRC:JaredW"
00563        SemeNode "ball@634a32ebc"
00564 
00565 At this time, the creation/naming of ContextNodes would be ad-hoc, on
00566 a case-by-case basis. All input from the MIT ConceptNet project would
00567 be marked with with something like
00568 
00569     ContextNode "# MIT ConceptNet dump 20080605"
00570 
00571 A SemeNode, once determined to be sufficiently general, might belong
00572 to several concepts. There might be a heirarchy of ConceptNode 
00573 inclusions: so, for example, concepts in "common-sense" contexts, such
00574 as ConceptNet, would be judged to be sufficiently universal to also hold
00575 in IRC contexts, Project Gutenberg contexts, Wikipedia contexts, etc.
00576 
00577 The only reason for using ContextLinks instead of
00578 
00579      EvaluationLink
00580          DefinedRelationshipNode "Context"
00581          ListLink
00582              ContextNode "# IRC:JaredW"
00583              SemeNode "ball@634a32ebc"
00584 
00585 Is to save a bit of RAM storage; there will be at least one ContextLink
00586 for every SemeNode.
00587 
00588 
00589 Boostraping
00590 -----------
00591 -- Read sentence from some source.
00592 -- Process sentence for triples
00593 -- Create initial SemeNodes that match key word instances.
00594 -- Add (temporary?) ContextLinks to indicate source.
00595 -- Scan existing SemeNodes for possible match.
00596 -- Merge SemeNodes if a plausible match is found. ("clustering")
00597 
00598 
00599 Current Implementation Status
00600 -----------------------------
00601 The current implementation does just about none of the whiz-bang stuff
00602 discussed above.  So far, just the most basic scaffolding has been set
00603 up.
00604 
00605 All "seme promoters" function by accepting a word-instance, and
00606 returning a corresponding seme. The various promoters are of
00607 different levels of sophistication, and use different algoorithms.
00608 They all create an InheritenceLink relating the original word instance
00609 to the seme, like so:
00610 
00611    InheritenceLink
00612       WordInstanceNode hello@123
00613       SemeNode  greeting@789
00614 
00615 Most/all of these also create a link to the lemmatized word form
00616 i.e. "the English-language word" that corresponds to the seme:
00617 
00618    LemmaLink
00619       SemeNode  greeting@789
00620       WordNode  hello
00621 
00622 All of these promoters could be, and eventually should be
00623 re-implemented as ImplicationLinks, so that all promotion runs
00624 entirely withing OpenCog.  For now, they are implemented in scheme.
00625 
00626 
00627 
00628 Deduction
00629 ---------
00630 Suppose we have a set of (consistent) prepositional relationships. What
00631 can we do with them?  For example, can we deduce that a certain verb is
00632 a type of locomotion, based on its use with regard to prepositions?
00633 
00634 Hmm. Time to write some rules, and experiment and see what happens.
00635 Not clear how unambiguous the copulas and preps will be.
00636 
00637 ToDo:
00638 -----
00639 Create the following new atoms types:
00640 ContextNode
00641 SemanticRelationshipNode
00642 
00643 todo -- add POS tagging!!
00644 
00645 
00646 ToDo: Reification of triples ...
00647 Re-examin Markov Logic Networks ...
00648 Phrasal verbs vs. prepositional phrases
00649 
00650 
00651 Notes
00652 -----
00653 Initially 51 secs to load 3K sentences
00654 However, if sqldb  is open, this explodes to 5 minutes!  (I guess 
00655 because an sql query is made for each new atom added to the atomspace).
00656 (err, well my system is 100% cpu even before running this...)
00657 
00658 8   27  16 secs per sentence ... 
00659 
00660 bugs: $prep maps to Ss, also to _subj ... 
00661 
00662 Took 995 minutes to process 3870 sentences, used 466 MBytes
00663 Took 288 minutes to process 3000 sentences, used 284 MBytes
00664 From 3000 sentences, got 6978 triples
00665 But only 1475 unique triples on 1475 semes
00666 
00667 
00668 References:
00669 -----------
00670 [FEAT] Feature extraction. See
00671    http://en.wikipedia.org/wiki/Feature_extraction
00672    http://en.wikipedia.org/wiki/Cluster_analysis
00673 
00674 
00675 Alexander Yates and Oren Etzioni.
00676 [http://www.cis.temple.edu/~yates/papers/resolver-jair09.pdf
00677 Unsupervised Methods for Determining Object and Relation Synonyms on
00678 the Web]. Journal of Artificial Intelligence Research 34, March,
00679 2009, pages 255-296.
00680 
00681 Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum
00682 SOFIE: A Self-Organizing Framework for Information Extraction
00683 WWW 2009 Madrid! 
00684 

Generated on Fri Dec 4 23:23:25 2009 for OpenCog Framework by  doxygen 1.5.6