Project Name : Yahoo Article Scraper
  Project Task : Scraps the article published in Yahoo news site
  Software Used : C++, Web-Harvest, JavaScript and XML
  Operating System : Linux and Windows XP SP2
  About the Project:

It is a desktop application which scraps the updated articles which is published in yahoo website by article published date and time. It starts scraping when starting-time, ending-time and delay-time has been given.

This application has control file which has list of stock name which are to be scraped. Search articles for specific stock-name and creates stock and article file for that article. After that it will update the control file (date and article title).For each publisher it will create a new folder and stores the article with name of publisher and date

 
  Flow for the project “Yahoo scraper”:

1. Get delay-time, start date-time and end date-time

2. • Check start date-time and end date-time. If current date-time is between those date-times then continue the work. If        current time exceeds end date-time then exit.

    • If delay-time is not given then it will take the default value. The default delay-time is 60 seconds

    • If the starting date-time is not given then it will take the default value. The default value for starting date-time is current       date-time.

    • If the ending date-time is not given then it will execute all the stocks once. After completing that it will exit.

3. Open control file

4. copy the 1st line to a variable

5. From that variable remove the white space & save it

6. count the no of lines starting from the 2nd line to find how many stocks are present

7. Loop starts here ( It will be executed up to the end-of-file)

8. read first 10 chars & store it in a variable

9. remove space & save it

10. store 11th and 12th chars in separate variables

11. 13th to 28th chars are date and time. store it in a variable

12. Other chars are article title (last 40 characters)

13. Check 1st stock’s status is yes then checks the date. If the stock doesn’t have then take the date given in first line(which is stored in a variable)

14. Send stock name, look-back-date and release-time to webharvest

15. Check delay-time

16. Using the three inputs from c++ we’ll find the following in webharvest: article content, article title, publisher, article date-time

17. Create a stock file

      X”STOCK NAME”. DAT

      Ex: XIBM.DAT

18. First line of the stock file is empty

19. 2nd line: 1st and 2nd chars are empty

20. Date and time of article (not done)

21. 2nd line: 80 chars are article title

22. After that 20 chars are article publisher

23. After that 30 chars are article’s filename without extension

24. Create article file

      W”stock name””Yr””Month””Date””release-time”

      Ex: WIBM0810231428A.ART

25. First line is empty

26. Rest of them are text of the article

27. Update control file (look-back-date, article title)

28. check the article’s look-back-date

29. End of loop

30. Check current time with end-time. If current time greater than or equal to end time then quit the process

QualityPoint Technologies

↑ Grab this Headline Animator