Talk:Automatic taxonomy construction

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Topic notes[edit]

Following are reminders for topics to look into for potential coverage within the article.

  • Approaches
    • Pattern- or rule-based approach
    • Clustering-based or distributional approach
  • OntoLearn
  • OntoLearn Reloaded
  • NLP methods used in ATC
  • ATC algorythms
  • ATC pdfs
  • ATC projects
  • Synonyms (to assist users in looking up more about it on the Web)

Removing second Further Reading Item[edit]

I'm doing some editing on this article. The first thing I'm doing is clean up any obvious problems. The second Further Reading item doesn't go to a paper it goes to a web site for a conference on federated systems and data mining. I'm going to delete it and any other items that are no longer relevant. If they are dead links I'll try wayback machine but the URL for the reading I'm deleting is just for the site of the conference. I'm not going to document every additional item I delete just wanted to post something here in case people want to discuss deletions or anything about the article. One thing that jumps out at me is the discussion of parents and children seems too colloquial to me and really misses the main point which is logical subsumption, one class/set being a superset of another. I think I'll change that but will try to still say it in a way understandable to non techies. If anyone has ideas or pointers to good articles please let me know. --MadScientistX11 (talk) 19:23, 7 March 2017 (UTC)[reply]

One minor correction: there was an actual URL for that item but when I tried the way back machine every archived version (there were only 3) went to a page not found and redirected. --MadScientistX11 (talk) 19:27, 7 March 2017 (UTC)[reply]
MadScientistX11 I agree that the subset/superset discussion could be made clearer. Two suggestions there - one is to make the distinction between subsets and instances (tokens vs types) more explicit. Labrador is a subclass of dog, but Fido is an instance. In some schemes (most?), that means using a different relation. In others, I imagine, it could all be handled with one "is-a" relation, with the distinction made implicitly by metadata about the items involved (both labrador and Fido "are" dogs, but labrador additionally is labelled as a type whereas Fido is labelled as a token). For example, I'm not certain if Wikipedia really has a notion of subcategories as distinct from categories that happen to be 'in' other categories in the same way as individual articles are 'in' them. The description of linguistic hyponymy seems to go that way too; it doesn't say explicitly that Fido is a hyponym of mammal, as dog is, but that was the impression I came away with after reading it. I'm happy to have a go at this part if it would be helpful.
The other suggestion is that the article mentions "taxonomies are often represented as is-a hierarchies" (later called an "is-a model"). Should it mention alternatives? What would the main alternatives be?
One other, separate suggestion - the paragraph that begins by explaining that taxonymy development is knowledge-intensive and potentially biased - if it's possible to find a reference for it, it would be good to include something to note the degree to which human judgement is still required to tidy up the output of automated methods or to bootstrap one, e.g. by putting in place the top-level abstract types, or by linking the subtrees that an automated process generated. That may be hard to find a reference for, though.
Sorry, one more - I'm interested now! It might be good to break this into sections a little. Perhaps the last two paragraphs could be split out into one (or two?) sections about 'applications' (or 'comparison to manual techniques' ('advantages') and 'example applications'). Just a thought. Mortee (talk) 20:01, 10 March 2017 (UTC)[reply]
Mortee Regarding the stuff about types vs. tokens First, just to make sure we are on the same page I would phrase this as subclasses vs. instances. Labrador is a subclass of dog and Fido is an instance... not saying either way is better just want to make sure I'm understanding. Those distinctions are important for traditional is-a hierarchies with programming languages but my impression (which could be wrong) is that they aren't significant in this Linguistic oriented work. What I found seemed to just talk about the hyponym vs. hypernym relations without distinguishing if it was a subclass or instance relation. But it could be that I just didn't drill down deep enough in the papers. Anyway, have at it, I agree its an important distinction and either way it would be good to be more explicit. I'm probably not going to be doing much more work on this for the time being so feel free. --MadScientistX11 (talk) 00:37, 14 March 2017 (UTC)[reply]
Regarding the point about taxonomy development being knowledge intensive and potentially subjective, that is supported by one of the references. I usually follow the standard that if a reference supports multiple points in a section you can just leave one note at the end of the section. But I agree with you, it would be good to be explicit, especially since there are several notes together at the end and its not clear which one supports those claims (which are definitely the kind of claims that need to be supported). I was actually going to do that but I got lazy. First, just to make sure we agree that it is supported here is the text that I had in mind: "The value of automatic taxonomy constructing is obvious: Manual taxonomy construction is a laborious process, and the resulting taxonomy is often highly subjective, compared with taxonomies built by data-driven approaches. Furthermore, automatic approaches have the potential to enable humans or even machines to understand a highly focused and potentially fast changing domain." That's from the second paragraph in the Introduction section of Automatic Taxonomy Construction from Keywords. If you agree that it supports the point (or think the point should be altered to be more consistent with their wording) feel free to edit. Or if you prefer I make that change let me know. --MadScientistX11 (talk) 00:37, 14 March 2017 (UTC)[reply]
@MadScientistX11:
  • type/token, class/instance - sure, I agree. I'd put the extra detail about the distinction in the context of taxonomies, rather than linguistics. For the same reason, you're probably right about using class/instance rather than type/token. They're the same distinction, but I've encountered type/token more in philosophy and class/instance more in maths and computer science, so class/instance is probably the more appropriate phasing for the article.
  • knowledge intensive - sorry I wasn't clear; I wasn't suggesting that the claim it was knowledge-intensive neeeded more referencing. I agree it's supported already (and it's obvious) I was suggesting that it would be good to add a note about what type/amount of human work is still needed when dealing with an automatically constructed taxonomy, if we can find a reference for that.
Mortee (talk) 07:17, 14 March 2017 (UTC)[reply]
Thanks. We are in violent agreement :) I'll keep an eye out and if I find anything that addresses that aspect of ATC will add the reference and more detail. --MadScientistX11 (talk) 15:04, 14 March 2017 (UTC)[reply]
one more point I realize I didn't respond to. You asked what are alternatives to Is-a hierarchies. First, the reason I wrote it that way was the old article was pretty strong saying they are always is-a models. I thought that sounded too strong and in reading through the papers used as references I pretty sure at least one said it more or less the way I wrote it, often but not always. Also, the Wikipedia article Taxonomy (general) says something similar, that taxonomies are usually tree structured (don't mention is-a but that's my interpretation) although they don't mention alternatives either. this is not my strong point but I think there are formalisims, sometimes called faceted classification, essentially keywords, where there are just a bunch of tags that can more or less be applied to any digital asset. Actually, even OWL kind of goes in that direction. They have is-a hierarchies of course but they emphasize that properties more than classes are what's most important and they discourage OO modelers like me from always wanting to define the domain and range of properties because the model isn't primarily a programming model but a model for organizing federated highly heterogeneous data. Also perhaps systems like the Dewey decimal system? Some of that could be wrong but that's the best I can come up with and I wanted to at least acknowledge that it's an excellent question, sorry I don't have a better answer. I've been sick and am working on an SBIR Proposal and also going to UBC Vancouver next week and the week after so I won't do much more in the next few weeks. I think all the ideas are good ones and encourage others to be wp:bold I'll check back once I'm home and see if I can contribute anything more. Cheers --MadScientistX11 (talk) 05:38, 16 March 2017 (UTC)[reply]

ATC don't have to be (and typically aren't) agents[edit]

The current article states: "ATC programs are examples of software agents and intelligent agents, and may be autonomous as well (see autonomous agent)." Nothing that I saw in any of the existing references, nor in any of the references I've found since, talks about using agents for ATC. Of course any kind of complex problem can usually be amenable to an agent approach but from what I've read so far its not common. The ATC systems seem to be pretty straight forward batch algorithms. You feed them a bunch of documents and they generate a taxonomy. I think the systems that provide the corpus may sometimes be web crawlers which are agents but typically not the ATC itself. I'm going to change this but wanted to document before I do in case anyone disagrees and wants to discuss it. --MadScientistX11 (talk) 04:17, 8 March 2017 (UTC)[reply]

Finished re-write of article[edit]

I just rewrote the article. Its still a stub article but I think its now at least a fairly coherent stub with inline references. There is now significant overlap between the "Further Reading" section and the references. At a minimum I think we should delete anything in further reading that is used as a reference. The reference format gives the user more information than the link format used in further reading and we risk pissing off users by having them click on links that are duplicates of references. Actually IMO we should just completely delete the Further Reading section. I tend to be very conservative with Further Reading refs. If its useful enough to be in further reading it should be useful enough to be used as a reference. Also, in my experience people tend to inflate those sections with their own papers or friend's papers. I think they should be reserved for the (rare) case where there is a well known work on the topic which for some reason isn't used as a reference. But I'll leave it as is for now and see what others think. --MadScientistX11 (talk) 21:12, 8 March 2017 (UTC)[reply]

I cleaned up the Further Reading section and deleted any items that were already used as references, or in one case there was the same paper listed twice. Left the remaining ones. --MadScientistX11 (talk) 19:22, 9 March 2017 (UTC)[reply]

Version before rewrite (for comparison)[edit]

Automatic taxonomy construction (ATC) is the use of autonomous or semi-autonomous software programs to create hierarchical outlines or taxonomical classifications from a body of texts (corpus). It is a branch of natural language processing, which in turn is a branch of artificial intelligence. ATC programs are examples of software agents and intelligent agents, and may be autonomous as well (see autonomous agent).

Other names for ATC include taxonomy generation, taxonomy learning, taxonomy extraction, taxonomy building, and taxonomy induction. Any of these terms may be preceded by the word "automatic", as in automatic taxonomy induction. ATC is also referred to as semantic taxonomy induction.

A taxonomy is a tree structure and includes familial (parent-offspring, sibling, etc.) relationships built-in (like in a family tree). For example, physics is an offspring of physical science, which in turn is an offspring of science.

As mentioned above, the process is also called taxonomy induction. This is because, in order for a software program to construct a taxonomy from a corpus (for example, from Wikipedia, a web page, or the World Wide Web), it must induce which terms belong to the taxonomy and what the relationships between them are. Such as by identifying hyponym-hypernym pairs, among other approaches. This is done using algorithms, including statistical algorithms. Note that deduction (deductive logic) is often also employed (e.g., if B is a sibling of A, then B has the same parent as A and gets placed under that parent in the taxonomy).

A primary application of automatic taxonomy construction is in ontology learning, a central activity within ontology engineering. In computer science and artificial intelligence, an ontology is a conceptual model of a (subject) domain. A domain is a given subject area or specifically defined sphere of interest. An ontology of a domain includes the vocabulary of that domain and the relationships between those concepts or entities. The backbone of most ontologies is a taxonomy, and taxonomical structure may be used throughout an ontology.

As building taxonomies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.

Here's the hard-link: https://en.wikipedia.org/w/index.php?title=Automatic_taxonomy_construction&oldid=769239706    — The Transhumanist   , posted 14:58, 9 March 2019

Thanks for including this and for the changes you made to clean up what I did. Nice working with you. --MadScientistX11 (talk) 00:09, 14 March 2017 (UTC)[reply]

Feedback on current state of article[edit]

The Transhumanist asked me for some general-reader feedback on this article, as I'm interested but not an expert in this area.

First, I think the "Other names" section breaks up the flow of reading (in general I don't like boring lists like this to be the first section of an article), so I would recommend moving this section to the end (as I recently did with the "Similar titles" section in Technical writer) or converting it to more succinct prose, like this: "Other names for automatic taxonomy construction include: (automatic) taxonomy generation, taxonomy learning, taxonomy extraction, taxonomy building, or (automatic or semantic) taxonomy induction."

I don't think the "Taxonomies" section is necessary. I would put the first two sentences of this section at the start of the second paragraph in the lead, omitting this section's third sentence. If a reader wants a history of taxonomies, she can click through to Taxonomy (general) and Taxonomy (biology).

I would put the "Approaches" and "Applications" sections before the "Hyponymy". It's not clear to me how important the discussion of hyponymy is: Is this really fundamental to understanding ATC, or is it just something that interests The Transhumanist? I can't tell, because there are no good inline citations to secondary or tertiary sources that I can check to see how experts on this subject contextualize this. (Yes, there is a list of external links in the "Further reading" section, but without full bibliographic information or annotation or inline citation it's very difficult for a non-expert to immediately evaluate which of these links would be most helpful.)

In general the article only gives a conceptual description of ATC but doesn't tell me anything about implementation: How is ATC implemented in terms of data models that an information-literate reader might already be familiar with? Biogeographist (talk) 16:14, 19 June 2019 (UTC)[reply]

@Biogeographist: I'm not the author of the hyponomy section, but I too find it relevant, as it provides some insight into how terms are gathered from a corpus for a taxonomy. Finding inline-specific citations will take time, as will reading up more on implementation. Most of the rest of your suggestions have been applied to the article. Nice tips (all of 'em). Thank you.    — The Transhumanist   09:09, 21 June 2019 (UTC)[reply]

Example cleaned up article[edit]

An article that I cleaned up back in January 2018, Rhetorical structure theory, may provide an interesting comparison in regard to details of implementation, because that article is about the same length as this one (about 10kb for RST vs. about 7kb for ATC). Both articles are not strong on details of implementation, but I get a better understanding of how RST is implemented in discourse parsing from that article than the understanding of how ATC is implemented from this article. This is mainly because there is a simple example in Rhetorical structure theory § Rhetorical relations and a graphic to aid understanding, and plenty of inline citations of reliable sources that I can consult for more information, even if the Wikipedia article itself is not as detailed as it could/should be. Biogeographist (talk) 17:00, 19 June 2019 (UTC)[reply]

@Biogeographist: Thank you, I'll take a look.    — The Transhumanist   09:09, 21 June 2019 (UTC)[reply]

Research vectors on taxonomy management[edit]

ATC is an activity that is part of the larger activity of taxonomy management, and may be comprised of the application of a wide assortment of NLP tools and techniques. To understand the context that ATC is applied, and ATC's components, it may be necessary to study the field in which it is applied....