Attribute-Mediated Dependences提供了cora dataset 的来源:
http://www.cs.umass.edu/~mccallum/data/(如果复制打不开,请自己手动敲到地址栏中)
论文A Pitfall and Solution inMulti-Class Feature Selection for TextClassification提供了启发,cora是有6大类,36个小类的.这样一来终于解决了相关性的难题.
(a)cora-refs.tar.gz数据集(c)cora-classify.tar.gz数据集
Cora Citation Matching [reference matching, object correspondence]
Text of citations hand-clustered into groups referring to the same paper.
(b) cora-ie.tar.gz数据集
Cora Information Extraction [information extraction]
Research paper headers and citations, with labeled segments for
authors, title, institutions, venue, date, page numbers and several
other fields.
CoraResearch Paper Classification [relational documentclassification]
Research papers classified into a topic hierarchy with 73 leaves.We call this a relational data set, because the citations providerelations among papers.
(d) cora-hmm.tar.gz
CoraHMM is the C implementation of HMMs used for informationextraction in Cora. It was written by Kristie Seymore.
Corareadme
Note thatin Cora there are two types of papers: those we found on the
Web, and those that are referenced in bibliographysections. It is
possible that a paper we found on the Web is also referenced byother
papers.
FILE SUMMARY:
* The file 'papers' contains limited information on the papers wefound
on the Web.
* The file 'citations' contains the citation.
* The file 'classifications' contains class labels
* The directory `extractions' contains the extracted authors,title,
abstract, etc, plus the references (and in some casessurrounding
text). from the postscript papers we found on the Web.
PAPERS
The file `papers' has a list of all the postscript filepapers.
Three fields, tab separated:
<id> <filename><citation string>
There are about about 52000 lines in this file, but there are abunch
of papers that have more than one postscriptfile. If you eliminate
lines with duplicate ids there are about 37000papers. Note the
citation string is either (1) an arbitrary bibliography referenceto
the paper, if one was made or (2) a constructed entry based onthe
authors and title extracted from the postscript file.
CITATIONS
The file 'citations' has the citation graph. Twofields, tab
separated:
<referring_id><cited_id>
The referring_id is the id of the paper that has thebibliography
section (always one we have postscript for). Thecited_id is the
paper referenced (we may or may not have postscript forit). There
are about 715000 citations.
CITATIONS.WITHAUTHORS
The file 'citations.withauthors' contains another copy of the
citation graph. This time we have also includedauthors and file
names of each paper in addition to each papers' unique paper_idand
the paper_id's of the references they make. The format of thisfile
is:
***
this_paper_id
filename
id_of_first_cited_paper
id_of_second_cited_paper
.
.
.
*
Author#1 (of this paper)
Author#2
.
.
.
CLASSIFICATIONS
The file `classifications' contains the research topicclassifications
for each of the files. The format of the file is:
"filename"+"t"+"classification". Forexample:
http:##www.ri.cmu.edu#afs#cs#user#alex#docs#idvl#dl97.ps/Information_Retrieval/Retrieval/
The file name is the url where the paper came translated to filename
by changing / to #. The classification the labelname in the Cora
directory hierarchy.
Note that the class labels were not perfectly assigned.
EXTRACTIONS
The directory 'extractions' contains 52906 files, one foreach
postscript paper that we found on the Web. Thedirectory contains so
many files, that you probably don't want to 'ls'it. Commands like
`find extractions -print' will probably work moreefficiently.
Each filename in the 'papers' file should have a filehere. I believe
there are also some (perhaps many?) extra files in this tarballthat
are not in paper-data that you can just ignore.
Each line of each file corresponds to some bit of data aboutthe
postscript file. Most of the MIME-like field tagsare
straightforward and explanatory. A fewnotes:
The fields URL, Refering-URL, and Root-URL are given by thespider.
All other fields are extracted automatically from the text, someby
hand-coded regular expressions and some by an HMM information
extractor.
The fields Abstract-found and Intro-found are binary valuedindicators
of whether Abstract and/or Introduction sections were found bysome
regular expression matching in the paper.
Each Reference field is one bibliography entry found at the end ofthe
paper. Note they are marked up using SGML-liketags. Each Reference
field is optionally followed by one (and possibly more?)
Reference-context fields that are snippets of the postscriptfile
around where the reference was cited.
引用Cora数据集的论文:
God blessme.想哭了,看了那么久,论文估计都上百篇了,终于终于终于找到了合适的数据集,每一步都很艰难,现在感觉到了.只是因为我的经验不够,使用过的数据集很有限,所以在选择数据集的时候才会那么困难.希望自己能把cora的弄明白,以后方便实验室的其他xdjms使用.
跟师兄聊天的时候呵呵他还笑话我,其实我明白他的意思的,也知道想要的是什么,只是因为没有办法去定义相关性而苦恼,有时候就会狗急跳墙的突发想法想人工标数据.但是人工标注太麻烦也没有说服力.就像师兄他们建议自己做数据,这个太体力活了,时间也是不允许的.好多时候兴致勃勃的发现了个新想法,然后被否定之后立马就泄气了,昨天实在是受不了,开始买东西,卖东西,就那么点嗜好和打发时间的方式.还好有2个自己喜欢的唇膏和唇彩留下了,相当于投入了一百五买了2样,有个限量版的到手了也是喜欢的,虽然从来不用.前段时间刚把裙子全卖了,又突然喜欢上裙子了,晕,看见得又要出动了,至少3个裙子,还要有搭配的靴子,已经有一双了,按道理至少还要一双.唉,算是栽了.上周网购了一双鞋子128,回来发现没衣服配,昏倒了!一直放着也懒得回去换.不过看见心里就不开心倒是真的.改天藏起来,眼不见心不烦.嘿嘿.faint!我本来是备份数据集的,怎么乱七八糟写起来这些了,鄙视自己.
赶紧忙正事,嘿嘿,笨笨扯皮的功夫唉,绝对一流,啥时候论文也根扯皮一样啊!